It’s been two months since my last post. This is not due to decreased interest, mind you, simply to the fact that creating the technical basis to experiment was rather long, and that in the meantime there wasn’t much progress to report on (I should seriously consider diversifying subjects on that blog, I know 😛 ).
So I went on with my plan: modifying the doxygen source code to extract information about the structure of the program. It wasn’t that complicated, but it took quite a while. The code isn’t commented that much, which is rather surprising for a program that specializes in extracting metadata from comments. It works fine, though, and that’s what I was looking for.
As I said, the little hack I created extracts information about function calls, functions and corresponding lines. I also preprocess the source code (target of analysis) to create a copy which contains only the words which I wan’t to consider.
With this information in hand, I will now be able to proceed with extraction of function bodies and body extensions.
There are a few limitations to this approach, though, and one that deserves mention is analysis of code containing polymorphism (creating pointers to objects with knowledge of only the parent class or interface, and not knowing precisely the object’s type until runtime). If a call is based on a pointer making use of polymorphism, it’s very difficult (or it just plainly makes no sense, in some cases) to extract links to the right classes/functions from the static source code — you may only know the parent class. That creates some noise in the data. For that reason, I’ll have to find a target source code for my experiments that doesn’t use polymorphism too much.
I now have to design detailed measures and statistics to be able to tell wether or not this approach* looks promising.
* At this point, I wouldn’t call it an “hypothesis”: it’s just a general path for further explorations. I’ll probably have to use hypothesises in the statistical sense, though.