Archive for November 2006

Ontology learning and ontology-based software engineering

About a week and a half ago, I briefly met with Michel. I had a feeling of what I was going to do next, but wanted to know if he had particular suggestions. He told me to go ahead if I felt I knew what do, but, neverthless, he suggested I look into “ontology-based software engineering”.

I was talking earlier about concept extraction from source code. What I meant, in fact, is ontology extraction. An ontology, in this context, is a “specification of a conceptualisation of a knowledge domain”. Stated another way, it represents concepts and links between them for a given topic. For example, an ontology about computers could have concepts “laptop”, “desktop” and “server” specified as children of the concept “computer”, “children of” being a type of relation.

They (ontologies) have many uses, providing a common base of exchange on a given topic, like a protocol. As they are very formal things, computer programs can analyze them and “reason” with the data they provide.

It turns out a lot of research has been/is going into ontologies. Ontology-based software engineering is an important area of it. In fact, there’s a workshop on a strongly related topic being held at this very moment, the 2nd International Workshop on Semantic Web Enabled Software Engineering. Ontologies are a very important component of what is called the Semantic Web (see an intro here), and therefore a lot of the presentations are concerned with them.

(One paper is particularly interesting for those wanting to know how they might be of use in software engineering, by the way: Applications of Ontologies in Software Engineering.)

The idea of automatically extracting and learning ontologies from texts and other sources is not new (see a survey here). Doing it on software projects isn’t totally new either, but it looks rather fresh. A few papers I could find are directly concerned with it:

  1. Extracting Ontologies from Legacy Systems for Understanding and Re-Engineering
  2. Learning Ontologies from Software Artifacts: Exploring and Combining Multiple Choices
  3. Extracting Ontologies from Software Documentation: a Semi-Automatic Method and its Evaluation

Software is a very structured thing. And with this structure comes meaning. If function F1 calls function F2, we could say F1 needs F2. That’s a relation, right there. If class C1 inherits from class C2, well, C1, as a concept, could be seen as a subconcept of C2. Some of those relation almost directly map to ontological relations. The paper OWL vs. Object Oriented Programming explores some of those mappings.

When learning ontologies, though, we’re more interested in high level concepts, “graphical user interface” instead of “GUIClass”, or, worse, a bunch of functions with ten-feet-long names doing that particular job. So it isn’t as simple as creating a graph of function calls. As this paper puts it, we need to link the “structural component” to the “domain component”.

In that particular direction, some interesting work is presented in the second paper in the list above. Their approach is to merge words from multiple sources, structured (eg. source code) and unstructured (eg. mailing lists) and then do some filtering magic to select the main concepts forming the ontology (they explain it better in the paper 😛 ). Writing about their future work, they mention, as that other paper above, exploiting inheritance for IS-A relationships.

* * *

Finally, coming back to my orienteering problem. In that approach I talked about in previous posts, the accent was on word frequencies, much less on structure. As there seems to be a lot of potential in exploiting the structure of software to infer structure in the “domain component”, I think that’s where I’ll be heading. Next, I’ll probably post about what particular approach I’ll be taking.