Archive for October 2006

The approach, with some sort of explanation

The general idea of the approach, as I used it*, is rather intuitive: separate documents in word frequencies, then relate words to documents as they ought to be related (yes, this statement could be made clearer), then group documents by similarity. The detail is rather complicated, though.

Separating documents in word frequencies means extracting the number of occurences of each word from each document. Said and done.

The “relate words to documents as they ought to be related” part is tricky to explain. Well, in fact, it’d be best if I understood it completely myself, heh. It’s rather easy to implement, though (see the corresponding function in my code). It’s based on a technique called Latent Semantic Analysis (LSA), which takes as input a matrix of word frequencies as rows and documents as colums, or vice versa, and outputs a matrix of the same dimensions, but adjusted.

I’m not sure of all the details involved — why it works –, but here’s some sort of explanation of what it does:

Suppose you take a set of documents of 500 words about a specific sunny vacation resort. Those documents tend to contain about the same words: “beach”, “sun”, etc. But each document being slightly different from one another, the number of occurrences of each of those words vary from one document to the next. If you’re only interested in the topic, those differences are irrelevant to you, though. If, for example, one of the documents doesn’t contain a single occurrence of the word “sun”, but lots and lots of “sunny”, well, the difference is quite irrelevant indeed.

Now, the interesting part: applying LSA to the word frequencies observed in these documents would result in a new and adjusted frequency for the word “sun”, higher than it’s real and physical frequency. Doing this for all words in all documents of a set, what you obtain is distributions of frequencies that reflects more adequately a “desirable” distribution for a set of documents concerning different topics. So if you have documents about sunny vacation resorts, skiing resorts etc., the documents concerning similar topics would have more similar distribution of word frequencies after applying LSA. On the other hand, differences in word frequencies would be higher for documents concerning different topics.

This operation is performed by transforming the original matrix in three intermediate ones, which can then be shrunk to smaller size. After having shrunk them, you get a matrix of the original size back, but the process of shrinking the intermediate matrices results in a final one with less “noise”, meaning less of those undesirable differences I talked about above.

That’s what it’s supposed to do in most cases, anyway.

For something quite a bit more detailed, with a real mathematical example, see: An Introduction to Latent Semantic Analysis. My explanations for LSA are heavily inspired by a part of that document, by the way.

Now for the “then group documents by similarity” part. Here, what we want are clearly defined sets: one document is a member of one and only one set, at the end of the process. Following a suggestion of M. Gagnon, I’ve done it with a basic clustering algorithm known as K-means. I’ve used the python-cluster library to do this.

This algorithm takes a set of vectors in N-dimensional space (N being the number of words, here), and, given a desired number of groups K (hence K-means), associates each point to one of those groups, defining those groups in the process. This means you don’t know a priori what those groups will be, and might end up with groups that have no relation to the topics you first thought about.

* The real work, being a master thesis, is quite a bit more detailed, as I’ve said in the previous post.

First steps

I’ve been searching to know if others had done some work on linguistic information extraction from source code. I used to think of this as rather specific (read: unexplored), but I now stand corrected:

Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code

This a master thesis done by Adrian Kuhn of the Software Composition Group at the University of Bern. When I found out about this, I had reacted along the lines of “*argh*, it’s already been done!”. My project supervisor thought otherwise and suggested I use the skeleton of the approach taken by the author* to begin my exploration. Which I did, with simplifications**.

So here’s the code, as of now, written in Python: lsa_cluster.py (GPL). Please note that I haven’t yet tested it on a real set of documents, only on a tiny (tiny (tiny (…))) set of tiny strings (=9 strings of 10-15 words). It worked perfectly for that case, though 😛 But its execution might take a while on a real set (hundreds of documents with thousands of distinct words). I have yet to assemble such a set to test it.

In the following weeks, I will try to post some general explanations of the topics involved.

* The real work by Adrian Kuhn, being a master thesis, is quite a bit more detailed, and is not simply a program. My code simply replicates the (very) general idea, as interpreted by me.

** In a comment, in my Python source code, you’ll find a list of the most notable simplifications involved.

Extracting concepts from source code

About myself, briefly, to give some context: I’m an undergraduate student in software engineering at École Polytechnique de Montréal.

For the following seven months or so, I’ll be working on a small (read: 4-5 hours a week) extra-curricular research project concerning “automatic methods to extract linguistic information from computer programs”. Stating it differently, that means I’ll be writing (or trying to write) programs to extract meaningful keywords, and relationships among them, from source code and related documents. This project is supervised by Michel Gagnon, a professor in the department of Computer Engineering at École Polytechnique de Montréal. He’s also the one who suggested the topic.

On another note, it has been a while since I wanted to start blogging, so I figured that this was the perfect occasion to do so. I’ll therefore be blogging about this project and interesting books, links and ideas I fall upon along the way. The audience I’ll have in mind is people with some familiarity with programming, but not necessarily with the specific topics I’ll be exploring.