Archive for the ‘UPIR Project’ Category.

*bump*

This blog still lives on… in my heart 😛 Someday I’ll post something about the end of these adventures.

UPDATE (2008/06/07): I realize I never actually posted anything about said end of adventures. The fact is, the project came to an end mostly due to time considerations. I had a summer internship which used most of my time, so I could not put much more of it on the project in the following months, and a lot more work would have been needed in experimentation and tweaking to find out whether or not the idea actually worked… so basically the project was shelved.

Technical foundations ready

It’s been two months since my last post. This is not due to decreased interest, mind you, simply to the fact that creating the technical basis to experiment was rather long, and that in the meantime there wasn’t much progress to report on (I should seriously consider diversifying subjects on that blog, I know 😛 ).

So I went on with my plan: modifying the doxygen source code to extract information about the structure of the program. It wasn’t that complicated, but it took quite a while. The code isn’t commented that much, which is rather surprising for a program that specializes in extracting metadata from comments. It works fine, though, and that’s what I was looking for.

As I said, the little hack I created extracts information about function calls, functions and corresponding lines. I also preprocess the source code (target of analysis) to create a copy which contains only the words which I wan’t to consider.

With this information in hand, I will now be able to proceed with extraction of function bodies and body extensions.

There are a few limitations to this approach, though, and one that deserves mention is analysis of code containing polymorphism (creating pointers to objects with knowledge of only the parent class or interface, and not knowing precisely the object’s type until runtime). If a call is based on a pointer making use of polymorphism, it’s very difficult (or it just plainly makes no sense, in some cases) to extract links to the right classes/functions from the static source code — you may only know the parent class. That creates some noise in the data. For that reason, I’ll have to find a target source code for my experiments that doesn’t use polymorphism too much.

I now have to design detailed measures and statistics to be able to tell wether or not this approach* looks promising.

* At this point, I wouldn’t call it an “hypothesis”: it’s just a general path for further explorations. I’ll probably have to use hypothesises in the statistical sense, though.

Relationship between words and enclosing pieces-of-code

It’s been a while since I last posted, the reason being that I preferred to wait to be able to work on the project for a few days in a row, which I did two weeks ago, but I still needed some time. While I still don’t have any working code, my ideas on how to proceed are getting much clearer. I’ll first explain what I want to produce, why, and then how I intend to produce it. This post will be long, but it needs to be.

In the last post, I said I intended to use the structure found in the software (function calls, class instances etc.) to induce structure in the domain concepts (“files”, “windows”…), as a few papers suggested. There are, of course, quite a lot of ways of doing this. Fundamental to all of them, I guess, is the way the words (or, more specifically, concepts) found in the source code relate to the piece of code they are found in.

For example, if the body of a function contains the word “window”, in a comment, variable name, whatever, what does this mean about the function? Is is that the function uses “window”, creates “window”…? Of course there’s no obvious answer, even if we perform some (simple) analysis of the context in which the word is used. The real answer comes from “general understanding”, which is a pretty difficult thing to code.

There is, however, some information about this relation that we might gather from looking at the program structure alone. As I said earlier, if a function F1 calls a function F2, then we might say F1 needs F2. That’s not a relation of concepts, just a relation of functions. The goal is to end up knowing that F1 is strongly related to the concept “window” (for instance), and F2 to “file”, and, from the “needing” relation, induce that “window” needs “file”.

Still lots of ways to do it. Here’s my shot at it.

I concentrate on two types of concept-“piece-of-code” relation (“piece-of-code” being a function, class, file or directory):

  • A concept is represented by a piece-of-code
  • A concept is needed by a piece-of-code (that is, the piece-of-code needs another (or a few other) piece-of-code which represents that concepts)
  • (for sake of completeness, I also need to mention the “no strong relation” possibility)

The reason I chose these two, and no more, is that they provide useful semantic information, and we can gather that information by simply analyzing the structure of the code. I considered using “relates to” instead of “needed”, but I feel I’m losing meaning in the abstraction, with no apparent justification.

Now, a word on why this distinction is useful. In the previous approach I considered (LSA and clustering, see previous posts), a word was either related with a piece-of-code, or it was not. The relatedness was deduced from the presence of absence of said word from the piece-of-code under consideration. Afterwards, algorithms were applied which determined which words related more to which pieces-of-code. Frequencies were the key here. But you ended up with pieces-of-code-as-reprensented-by-XYZ-words, with no regards as to wether the concept was indeed descriptive of the piece-of-code, or simply very strongly related to it. My idea would be to be able to make such a distinction, to then have better data to be used in creating an ontology.

Problem is: I don’t know is if the linguistic information in the code (variable names, comments, mainly) is usually sufficiently clear to establish such a distinction in a rather simplistic programmatic way. That’s where experimentation comes in, and I’ll describe the way I intend to experiment first.

Now, to go any further, I need to introduce the concept of body extension* of a piece-of-code. Usually, the function body is simply the code between its enclosing braces (function foo(){ return bar; } -> “return bar;” is the body). Now, the body extension is the code not found in the body of the function, but found in the code surrounding it’s relations “spots” (function call, variable use). So, if function foo() is called the following way:

function caller() {

// We now call foo() to get bar

bar = foo();

}

then the comment preceding the call is part of the body extension of the function. That comment is also part of the body of of caller(), though. The intent, here, is to add the frequencies of the words in the calling environment (lines surrounding call) to the word frequencies associated with the function foo(). Seeing that those words are part of the body extension of some other piece-of-code, though, we could also decrease the importance of the link to the caller() function (by decreasing their frequency a bit, or something).

Now, having “cleaned” the word frequencies that way, if we apply algorithms based on frequency (ex: LSA and clustering, as mentionned before), the words will tend to be associated less with “needing” pieces-of-code, and more with “representing” pieces-of-code. Then, if a word is very frequent in a piece of code, but is not in “representing” relation to it, it must be that it is in “needing” relation with it. I must say, though, that this is certainly not the cleanest nor most interesting use of the body extension concept, but this post is getting longishly long, and I need to get down to the technical steps one day or the other. Note, also, that the whole point here is to see if this concept adds truly useful information, and, following experimentation, further guide development of some kind of ontology extraction method.

So here are the steps, in birds-eye view:

  1. Extract information about pieces-of-code, that is, extract the hierarchy of directories/files/classes/functions, and have a way to track down line numbers corresponding to those pieces-of-code.
  2. Extract information about pieces-of-code uses (function calls, class uses), and have a way to track down line numbers, here, too.
  3. Create a copy of the code, and process it by extending identifier (variable names, function names, etc.) to complete words as much as possible.
  4. Using line numbers found in steps 1 and 2, and using some definition for the body extension (say, 2 lines above and 1 below call), extract frequencies.
  5. Apply *something* to those frequencies. In my case it’ll probably be LSA and clustering, at first. Once here, I can also experiment with different combinations of body/body extension frequencies.

I intend to extract structure by modifying the source code of doxygen, a tool that generates documentation. The reason I chose this tool, and not some more specialized fact extractor or other reverse engineering software, is that doxygen keeps track of line numbers for function calls etc., fundamental to the body extension thing. It’s not the cleanest solution (since a byproduct of the operation is generating documentation!), but from what I gather, doxygen can extract structure from a few important languages, and does so in a robust manner, creating very useful data structure along the way (fully referenced function calls, extending from line #1 to line #2…).

So that’s what I intend to do. I might also completely change my mind if a better idea comes along, but I think to make further progress in creating an ontology, I need to clarify the relationship between words and pieces-of-code. Another important point is that the tools created along the way will be useful in my next steps, and that’s important since almost half of my time on that project has already been spent.

* I had first called it “extended body”, but as Michel noted, it makes more sense to say “body extension”.

Ontology learning and ontology-based software engineering

About a week and a half ago, I briefly met with Michel. I had a feeling of what I was going to do next, but wanted to know if he had particular suggestions. He told me to go ahead if I felt I knew what do, but, neverthless, he suggested I look into “ontology-based software engineering”.

I was talking earlier about concept extraction from source code. What I meant, in fact, is ontology extraction. An ontology, in this context, is a “specification of a conceptualisation of a knowledge domain”. Stated another way, it represents concepts and links between them for a given topic. For example, an ontology about computers could have concepts “laptop”, “desktop” and “server” specified as children of the concept “computer”, “children of” being a type of relation.

They (ontologies) have many uses, providing a common base of exchange on a given topic, like a protocol. As they are very formal things, computer programs can analyze them and “reason” with the data they provide.

It turns out a lot of research has been/is going into ontologies. Ontology-based software engineering is an important area of it. In fact, there’s a workshop on a strongly related topic being held at this very moment, the 2nd International Workshop on Semantic Web Enabled Software Engineering. Ontologies are a very important component of what is called the Semantic Web (see an intro here), and therefore a lot of the presentations are concerned with them.

(One paper is particularly interesting for those wanting to know how they might be of use in software engineering, by the way: Applications of Ontologies in Software Engineering.)

The idea of automatically extracting and learning ontologies from texts and other sources is not new (see a survey here). Doing it on software projects isn’t totally new either, but it looks rather fresh. A few papers I could find are directly concerned with it:

  1. Extracting Ontologies from Legacy Systems for Understanding and Re-Engineering
  2. Learning Ontologies from Software Artifacts: Exploring and Combining Multiple Choices
  3. Extracting Ontologies from Software Documentation: a Semi-Automatic Method and its Evaluation

Software is a very structured thing. And with this structure comes meaning. If function F1 calls function F2, we could say F1 needs F2. That’s a relation, right there. If class C1 inherits from class C2, well, C1, as a concept, could be seen as a subconcept of C2. Some of those relation almost directly map to ontological relations. The paper OWL vs. Object Oriented Programming explores some of those mappings.

When learning ontologies, though, we’re more interested in high level concepts, “graphical user interface” instead of “GUIClass”, or, worse, a bunch of functions with ten-feet-long names doing that particular job. So it isn’t as simple as creating a graph of function calls. As this paper puts it, we need to link the “structural component” to the “domain component”.

In that particular direction, some interesting work is presented in the second paper in the list above. Their approach is to merge words from multiple sources, structured (eg. source code) and unstructured (eg. mailing lists) and then do some filtering magic to select the main concepts forming the ontology (they explain it better in the paper 😛 ). Writing about their future work, they mention, as that other paper above, exploiting inheritance for IS-A relationships.

* * *

Finally, coming back to my orienteering problem. In that approach I talked about in previous posts, the accent was on word frequencies, much less on structure. As there seems to be a lot of potential in exploiting the structure of software to infer structure in the “domain component”, I think that’s where I’ll be heading. Next, I’ll probably post about what particular approach I’ll be taking.

The approach, with some sort of explanation

The general idea of the approach, as I used it*, is rather intuitive: separate documents in word frequencies, then relate words to documents as they ought to be related (yes, this statement could be made clearer), then group documents by similarity. The detail is rather complicated, though.

Separating documents in word frequencies means extracting the number of occurences of each word from each document. Said and done.

The “relate words to documents as they ought to be related” part is tricky to explain. Well, in fact, it’d be best if I understood it completely myself, heh. It’s rather easy to implement, though (see the corresponding function in my code). It’s based on a technique called Latent Semantic Analysis (LSA), which takes as input a matrix of word frequencies as rows and documents as colums, or vice versa, and outputs a matrix of the same dimensions, but adjusted.

I’m not sure of all the details involved — why it works –, but here’s some sort of explanation of what it does:

Suppose you take a set of documents of 500 words about a specific sunny vacation resort. Those documents tend to contain about the same words: “beach”, “sun”, etc. But each document being slightly different from one another, the number of occurrences of each of those words vary from one document to the next. If you’re only interested in the topic, those differences are irrelevant to you, though. If, for example, one of the documents doesn’t contain a single occurrence of the word “sun”, but lots and lots of “sunny”, well, the difference is quite irrelevant indeed.

Now, the interesting part: applying LSA to the word frequencies observed in these documents would result in a new and adjusted frequency for the word “sun”, higher than it’s real and physical frequency. Doing this for all words in all documents of a set, what you obtain is distributions of frequencies that reflects more adequately a “desirable” distribution for a set of documents concerning different topics. So if you have documents about sunny vacation resorts, skiing resorts etc., the documents concerning similar topics would have more similar distribution of word frequencies after applying LSA. On the other hand, differences in word frequencies would be higher for documents concerning different topics.

This operation is performed by transforming the original matrix in three intermediate ones, which can then be shrunk to smaller size. After having shrunk them, you get a matrix of the original size back, but the process of shrinking the intermediate matrices results in a final one with less “noise”, meaning less of those undesirable differences I talked about above.

That’s what it’s supposed to do in most cases, anyway.

For something quite a bit more detailed, with a real mathematical example, see: An Introduction to Latent Semantic Analysis. My explanations for LSA are heavily inspired by a part of that document, by the way.

Now for the “then group documents by similarity” part. Here, what we want are clearly defined sets: one document is a member of one and only one set, at the end of the process. Following a suggestion of M. Gagnon, I’ve done it with a basic clustering algorithm known as K-means. I’ve used the python-cluster library to do this.

This algorithm takes a set of vectors in N-dimensional space (N being the number of words, here), and, given a desired number of groups K (hence K-means), associates each point to one of those groups, defining those groups in the process. This means you don’t know a priori what those groups will be, and might end up with groups that have no relation to the topics you first thought about.

* The real work, being a master thesis, is quite a bit more detailed, as I’ve said in the previous post.