The general idea of the approach, as I used it*, is rather intuitive: separate documents in word frequencies, then relate words to documents as they ought to be related (yes, this statement could be made clearer), then group documents by similarity. The detail is rather complicated, though.
Separating documents in word frequencies means extracting the number of occurences of each word from each document. Said and done.
The “relate words to documents as they ought to be related” part is tricky to explain. Well, in fact, it’d be best if I understood it completely myself, heh. It’s rather easy to implement, though (see the corresponding function in my code). It’s based on a technique called Latent Semantic Analysis (LSA), which takes as input a matrix of word frequencies as rows and documents as colums, or vice versa, and outputs a matrix of the same dimensions, but adjusted.
I’m not sure of all the details involved — why it works –, but here’s some sort of explanation of what it does:
Suppose you take a set of documents of 500 words about a specific sunny vacation resort. Those documents tend to contain about the same words: “beach”, “sun”, etc. But each document being slightly different from one another, the number of occurrences of each of those words vary from one document to the next. If you’re only interested in the topic, those differences are irrelevant to you, though. If, for example, one of the documents doesn’t contain a single occurrence of the word “sun”, but lots and lots of “sunny”, well, the difference is quite irrelevant indeed.
Now, the interesting part: applying LSA to the word frequencies observed in these documents would result in a new and adjusted frequency for the word “sun”, higher than it’s real and physical frequency. Doing this for all words in all documents of a set, what you obtain is distributions of frequencies that reflects more adequately a “desirable” distribution for a set of documents concerning different topics. So if you have documents about sunny vacation resorts, skiing resorts etc., the documents concerning similar topics would have more similar distribution of word frequencies after applying LSA. On the other hand, differences in word frequencies would be higher for documents concerning different topics.
This operation is performed by transforming the original matrix in three intermediate ones, which can then be shrunk to smaller size. After having shrunk them, you get a matrix of the original size back, but the process of shrinking the intermediate matrices results in a final one with less “noise”, meaning less of those undesirable differences I talked about above.
That’s what it’s supposed to do in most cases, anyway.
For something quite a bit more detailed, with a real mathematical example, see: An Introduction to Latent Semantic Analysis. My explanations for LSA are heavily inspired by a part of that document, by the way.
Now for the “then group documents by similarity” part. Here, what we want are clearly defined sets: one document is a member of one and only one set, at the end of the process. Following a suggestion of M. Gagnon, I’ve done it with a basic clustering algorithm known as K-means. I’ve used the python-cluster library to do this.
This algorithm takes a set of vectors in N-dimensional space (N being the number of words, here), and, given a desired number of groups K (hence K-means), associates each point to one of those groups, defining those groups in the process. This means you don’t know a priori what those groups will be, and might end up with groups that have no relation to the topics you first thought about.
* The real work, being a master thesis, is quite a bit more detailed, as I’ve said in the previous post.