First steps
I’ve been searching to know if others had done some work on linguistic information extraction from source code. I used to think of this as rather specific (read: unexplored), but I now stand corrected:
Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code
This a master thesis done by Adrian Kuhn of the Software Composition Group at the University of Bern. When I found out about this, I had reacted along the lines of “*argh*, it’s already been done!”. My project supervisor thought otherwise and suggested I use the skeleton of the approach taken by the author* to begin my exploration. Which I did, with simplifications**.
So here’s the code, as of now, written in Python: lsa_cluster.py (GPL). Please note that I haven’t yet tested it on a real set of documents, only on a tiny (tiny (tiny (…))) set of tiny strings (=9 strings of 10-15 words). It worked perfectly for that case, though
But its execution might take a while on a real set (hundreds of documents with thousands of distinct words). I have yet to assemble such a set to test it.
In the following weeks, I will try to post some general explanations of the topics involved.
* The real work by Adrian Kuhn, being a master thesis, is quite a bit more detailed, and is not simply a program. My code simply replicates the (very) general idea, as interpreted by me.
** In a comment, in my Python source code, you’ll find a list of the most notable simplifications involved.

Leave a comment