So if I don’t want to use files and folders, what else can I try?
Something I’ve used in other organisation systems is keyword tagging, and that works much better for my brain. When I store something, I add a number of “tags” – one or more keywords that describe the document. Later, I can filter to find documents that have particular tags as a form of search.
I once heard tags described as a “search engine in reverse”, and it’s a nice image. I’m adding the keywords that I’ll likely use to search for something later. If I think I might look for something in three different ways, I can give it three different tags.
Consider the electricity bill again. Rather than putting it in a single folder, I could tag it with “home” and “utility bills” and “acme energy”, and I could find it later by searching for any one of those tags.
It helps that I have a good example of tagging to emulate: tagging in the world of fandom. On bookmarking sites like Pinboard and Delicious, fans have created intricate systems of tags to describe fanfiction, and by combining tags you can make very specific queries. There are shared conventions to describe word count, the fandom, the pairing, the trope, and many other things beside – which means you can even search bookmarks that were tagged by somebody else. For specific examples, I really recommend Maciej Cegłowski’s talk Fan is a Tool-Using Animal.
I use tagging in my Pinboard account (including for fanfic), so I’m quite used to it, and I know I like it. I decided to use tagging as the basis for my PDF organisation.
At a minimum, I want a PDF organiser that:
This is similar to other tools I’ve built before – I’ve built lot of variants of an image organiser that uses tags. I chose to write it in Python, because that’s the language I’m most familiar with, and it let me get started quickly, but you could implement this idea in a lot of languages. (If I was starting fresh today, I’d be tempted to write it in Rust.)
I built the initial prototype with the responder web framework. That was a year ago, and I got the core features working in a few hours – then I’ve been adding polish and new features ever since.
I’ve recently switched to the more popular Flask, which is a great library for writing small web apps. (If you aren’t familiar with it, start with Miguel Grinberg’s Flask Mega-Tutorial.)
I have a bunch of libraries doing the heavy lifting, including:
The whole app is packaged in a Docker image, to make deployments easy. I can just as easily run it on my Linux web server as on my home Mac. If you have Docker installed, you can run it like so:
docker run \
--publish 8072:8072 \
--volume /path/to/documents:/documents \
greengloves/docstore:latest
This starts the web app running on http://localhost:8072, and any files you upload will be saved to /path/to/documents
.
If you’d like to read the source code, it’s all available on GitHub.
For simplicity, docstore only has a single screen. Here’s what it looks like, storing some of my ebooks:
Most of the screen is taken up with a list of documents. Each document has a one-line description, a thumbnail, and some metadata.
The thumbnails make it easy to identify a document at a glance – book covers are particularly good for this, but it works in letters too. Companies tend to use consistent letterheads, so I learn to spot particular patterns as I’m scrolling a list.
The metadata includes the date I stored something (not necessarily the date of the document itself – I scanned a lot of stuff long before I saved it in docstore), and a list of tags. If I click one of the tags, it filters the documents to ones that have that tag. Tags stack, so if I click “programming” and then “programming:python”, I’ll only see documents that have both of those tags.
In the navbar, there are options to sort by title or by date:
The “Store document” button opens the form for adding a new files. It’s a standard web form:
Although I originally built this to handle scanned PDFs, I get a lot of correspondence electronically – for example, I get my bank statements from an online portal, not in the post. I want to keep all those documents alongside my scanned papers, so I store them in docstore too, and the source URL lets me track where I downloaded a file from.
The “Show tags” button shows a list of tags in the current view. Clicking any one of the tags will filter the documents to ones that have that tag:
This list is context-dependent: if I’ve already applied a tag query, it shows me the list of tags for documents that match my query. For example, if I selected the “programming” tag, I’d only see the tags used by files that are tagged with “programming”.
When I get a piece of paper, this is what I do with it:
I try to scan everything the day it arrives, so I don’t build up a backlog.
I use semi-structured tags, with a common prefix to group similar tags. Here are some examples of what my tags look like:
bank:credit-card-4567
car:austin-WLG142E
health:optician
home:667-dark-avenue
payslips
providers:acme-energy
travel
utilities:water
I run several instances of docstore, each one for a different type of document:
At time of writing, I’ve got 1585 PDFs with 23,795 pages, and most of the original paper has been recycled. It’s a big saving!
Buy a document scanner, decide how you want to organise, start scanning!
If you want to buy a document scanner, I like my Canon ImageFORMULA and I’d be happy to buy another from the same line. I also trust recommendations from Wirecutter, who discuss the topic in more detail.
It’s worth thinking about how you’ll organise your scans before you start scanning your existing paper – whether you use keyword tagging like me, some files and folders, or something else. Depending on what you decide, it might be much easier to organise as you go along, rather than build up a big backlog, so sort that out early!
If you want to run docstore yourself, the code and deployment instructions are all on GitHub: https://github.com/alexwlchan/docstore
If you enjoyed this post, you might also want to read:
Designing better file organization around tags, not hierarchies, by Nayuki. This is a detailed essay about a design for a filesystem that’s based entirely around keyword tagging, not hierarchies. This essay informed some of the internal design decisions in docstore.
Fan is a Tool-Using Animal (video, transcript), by Maciej Cegłowski. This is a talk about the use of tagging and similar systems in fannish circles.
Situated Software, by David MacIver. Although I didn’t mention the term above, docstore feels like a good example of situated software.