Organizing documents for easy retrieval
Given the original topic of this blog (personal knowledge management), it’s been a while since I should have addressed the issue of filing documents for easier retrieval. I haven’t done it yet because the issue is just so large (many options). To solve this, I’ll just start small, with what I do personally, and grow the topic as time goes on.
By the way, the following are just simple tips I use. Some will seem obvious maybe, but they lay ground work for (potential) next posts.
Motivation and (*ahem*) philosophical considerations
Since I started graduate studies, I’ve been accumulating a lot of documents, mostly PDFs of research papers (which, btw, I annotate heavily with PDF XChange Viewer, see this post). Yet it’s easy to forget where I’ve put one of them.
Now, for academic documents, there are dedicated solutions for document management, such as Zotero. These obviously offer lots of options for filing, searching (and citing).
Yet I have a tendency to prefer lightweight solutions, based on basic filesystem principles (filenames and directories). This simplifies incremental backups, never becomes obsolete, and will always be cross-platform
Also, from a programmer perspective, it’s easy to use such a structure with scripts, too, if the need ever arises (e.g. for backup scripts).
Using plain old filenames with “tags”
My current system is simply a hierarchy of documents in directories (dah, but bear with me). Yet I’m careful in the way I name the documents. A basic problem with hierarchies is that a given document can often be placed in multiple places. In turn, a common solution is to place it somewhere that nevertheless makes sense, even though there might be other classification options, and then use tags (keywords) for those other options. Then you can list by tags (see Virtual folders) and get it in multiple places.
Concretely, I name my documents this way: “title, authors, date, tags .extension”. For example, “Learning representations by back-propagating errors, Rumelhart Hinton Williams, 1986, neural networks, machine learning.pdf”.
Yet hierarchies are also highly intuitive, so most of the time I can locate the document by browsing the filesystem, not searching. So, the final solution is: 1) try hierarchy, if that doesn’t work, 2) search filenames for keywords, authors, dates, etc.
This might seem quite obvious, actually, but it requires planning. Notably, consistency is required for tags. If applicable, I try to use directory names as tags. For example, a document might be relevant to “signal processing” and “machine learning”. So I’d use the tag “digital signal processing” if I filed it under the “machine learning” directory.
Virtual folders
Now the fun part of being consistent in tags and naming: if you’re using this convention, you can then use the Virtual folder principle to group documents by tags, as if they were in a real directory, which would otherwise be in different directories. That way, in my “signal processing” directory, I can create a saved search which will grab the document I put in the “machine learning” directory automatically.
Basically, instead of typing a query each time, you just save it like a file, and the results now behave like a new directory. This is available on all OSes, see this Wikipedia page.
Searching in the files themselves
I use filename search because it’s fast and very easy, and makes “virtual folders” work real swift. It’s also possible to search inside the files, of course. I won’t say much about this, except mention the essential: for content search to be fast, files need to be indexed in advance.
There are many programs which can do this, and some of them now come integrated with operating systems (Explorer does it on Windows, Spotlight for Mac OS, and there are a variety of options for Linux, notably Beagle). For more advanced functionality and to handle more information sources (e.g. emails, IMs…), Google Desktop must be mentioned.
Musings and relation to other posts
- The “lightweight” principle starts to come back often in my posts. It’s one strength, to me, of Wikidpad: wiki entries are plain text files.
- The hierarchy-and-tags principle is the idea behind might Wikidpad “saved search” extension.

Leave a comment