First steps

I’ve been searching to know if others had done some work on linguistic information extraction from source code. I used to think of this as rather specific (read: unexplored), but I now stand corrected:

Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code

This a master thesis done by Adrian Kuhn of the Software Composition Group at the University of Bern. When I found out about this, I had reacted along the lines of “*argh*, it’s already been done!”. My project supervisor thought otherwise and suggested I use the skeleton of the approach taken by the author* to begin my exploration. Which I did, with simplifications**.

So here’s the code, as of now, written in Python: lsa_cluster.py (GPL). Please note that I haven’t yet tested it on a real set of documents, only on a tiny (tiny (tiny (…))) set of tiny strings (=9 strings of 10-15 words). It worked perfectly for that case, though :-P But its execution might take a while on a real set (hundreds of documents with thousands of distinct words). I have yet to assemble such a set to test it.

In the following weeks, I will try to post some general explanations of the topics involved.

* The real work by Adrian Kuhn, being a master thesis, is quite a bit more detailed, and is not simply a program. My code simply replicates the (very) general idea, as interpreted by me.

** In a comment, in my Python source code, you’ll find a list of the most notable simplifications involved.

Leave a comment