Archive for December 2006

Relationship between words and enclosing pieces-of-code

It’s been a while since I last posted, the reason being that I preferred to wait to be able to work on the project for a few days in a row, which I did two weeks ago, but I still needed some time. While I still don’t have any working code, my ideas on how to proceed are getting much clearer. I’ll first explain what I want to produce, why, and then how I intend to produce it. This post will be long, but it needs to be.

In the last post, I said I intended to use the structure found in the software (function calls, class instances etc.) to induce structure in the domain concepts (“files”, “windows”…), as a few papers suggested. There are, of course, quite a lot of ways of doing this. Fundamental to all of them, I guess, is the way the words (or, more specifically, concepts) found in the source code relate to the piece of code they are found in.

For example, if the body of a function contains the word “window”, in a comment, variable name, whatever, what does this mean about the function? Is is that the function uses “window”, creates “window”…? Of course there’s no obvious answer, even if we perform some (simple) analysis of the context in which the word is used. The real answer comes from “general understanding”, which is a pretty difficult thing to code.

There is, however, some information about this relation that we might gather from looking at the program structure alone. As I said earlier, if a function F1 calls a function F2, then we might say F1 needs F2. That’s not a relation of concepts, just a relation of functions. The goal is to end up knowing that F1 is strongly related to the concept “window” (for instance), and F2 to “file”, and, from the “needing” relation, induce that “window” needs “file”.

Still lots of ways to do it. Here’s my shot at it.

I concentrate on two types of concept-“piece-of-code” relation (“piece-of-code” being a function, class, file or directory):

  • A concept is represented by a piece-of-code
  • A concept is needed by a piece-of-code (that is, the piece-of-code needs another (or a few other) piece-of-code which represents that concepts)
  • (for sake of completeness, I also need to mention the “no strong relation” possibility)

The reason I chose these two, and no more, is that they provide useful semantic information, and we can gather that information by simply analyzing the structure of the code. I considered using “relates to” instead of “needed”, but I feel I’m losing meaning in the abstraction, with no apparent justification.

Now, a word on why this distinction is useful. In the previous approach I considered (LSA and clustering, see previous posts), a word was either related with a piece-of-code, or it was not. The relatedness was deduced from the presence of absence of said word from the piece-of-code under consideration. Afterwards, algorithms were applied which determined which words related more to which pieces-of-code. Frequencies were the key here. But you ended up with pieces-of-code-as-reprensented-by-XYZ-words, with no regards as to wether the concept was indeed descriptive of the piece-of-code, or simply very strongly related to it. My idea would be to be able to make such a distinction, to then have better data to be used in creating an ontology.

Problem is: I don’t know is if the linguistic information in the code (variable names, comments, mainly) is usually sufficiently clear to establish such a distinction in a rather simplistic programmatic way. That’s where experimentation comes in, and I’ll describe the way I intend to experiment first.

Now, to go any further, I need to introduce the concept of body extension* of a piece-of-code. Usually, the function body is simply the code between its enclosing braces (function foo(){ return bar; } -> “return bar;” is the body). Now, the body extension is the code not found in the body of the function, but found in the code surrounding it’s relations “spots” (function call, variable use). So, if function foo() is called the following way:

function caller() {

// We now call foo() to get bar

bar = foo();

}

then the comment preceding the call is part of the body extension of the function. That comment is also part of the body of of caller(), though. The intent, here, is to add the frequencies of the words in the calling environment (lines surrounding call) to the word frequencies associated with the function foo(). Seeing that those words are part of the body extension of some other piece-of-code, though, we could also decrease the importance of the link to the caller() function (by decreasing their frequency a bit, or something).

Now, having “cleaned” the word frequencies that way, if we apply algorithms based on frequency (ex: LSA and clustering, as mentionned before), the words will tend to be associated less with “needing” pieces-of-code, and more with “representing” pieces-of-code. Then, if a word is very frequent in a piece of code, but is not in “representing” relation to it, it must be that it is in “needing” relation with it. I must say, though, that this is certainly not the cleanest nor most interesting use of the body extension concept, but this post is getting longishly long, and I need to get down to the technical steps one day or the other. Note, also, that the whole point here is to see if this concept adds truly useful information, and, following experimentation, further guide development of some kind of ontology extraction method.

So here are the steps, in birds-eye view:

  1. Extract information about pieces-of-code, that is, extract the hierarchy of directories/files/classes/functions, and have a way to track down line numbers corresponding to those pieces-of-code.
  2. Extract information about pieces-of-code uses (function calls, class uses), and have a way to track down line numbers, here, too.
  3. Create a copy of the code, and process it by extending identifier (variable names, function names, etc.) to complete words as much as possible.
  4. Using line numbers found in steps 1 and 2, and using some definition for the body extension (say, 2 lines above and 1 below call), extract frequencies.
  5. Apply *something* to those frequencies. In my case it’ll probably be LSA and clustering, at first. Once here, I can also experiment with different combinations of body/body extension frequencies.

I intend to extract structure by modifying the source code of doxygen, a tool that generates documentation. The reason I chose this tool, and not some more specialized fact extractor or other reverse engineering software, is that doxygen keeps track of line numbers for function calls etc., fundamental to the body extension thing. It’s not the cleanest solution (since a byproduct of the operation is generating documentation!), but from what I gather, doxygen can extract structure from a few important languages, and does so in a robust manner, creating very useful data structure along the way (fully referenced function calls, extending from line #1 to line #2…).

So that’s what I intend to do. I might also completely change my mind if a better idea comes along, but I think to make further progress in creating an ontology, I need to clarify the relationship between words and pieces-of-code. Another important point is that the tools created along the way will be useful in my next steps, and that’s important since almost half of my time on that project has already been spent.

* I had first called it “extended body”, but as Michel noted, it makes more sense to say “body extension”.