fSignatures, a JavaScript library to check argument types and object interfaces

I’ve just written a small JavaScript library that allows you to wrap a function in another one, which will check whether the parameters and return value are of the right type. It’s a functionality I miss from statically typed languages. It also allows you to check whether a set of functions, along with specified signatures, are present in an object. This is similar to the concept of interface found in other languages, especially Java.

Some functionality may be missing. I’m trying to measure interest in the lib before investing more time in any particular direction.

Firefox: creating a shortcut to ScrapBook highlight operations with the Keyconfig extension

Summary: ScrapBook is a Firefox extension that allows one to save snippets from web pages locally, and to add notes and highlights to them. I wanted shortcuts to access the highlighting features. Using the Keyconfig extension, which allows you to create new keyboard shortcuts, I associated 4 keyboard commands to the 4 available highlight styles.

The bare steps

(The following, of course, assumes having the ScrapBook extension installed.)

1. I installed the Keyconfig extension and restarted Firefox. It’s available here, just click on keyconfig.xpi.

2. In Tools > Keyconfig, I first selected the “Keyconfig” option in the dropdown list at the top of the window.

3. For each highlight operation (there are 4 possible styles), I created a new key with the titles you may see in the screenshot (”scrapbook highlight 1″). The code associated with each is:

    sbPageEditor.highlight(1)

and I replaced “1″ by 2, 3 and 4 for the other operations. Also, I checked the “global” checkbox a the top of the “key editor” window.

4. After having defined that operation, I assigned it a keyboard shortcut , trying not to conflict with existing shortcuts for other operations.

5. I also configured two other operations which I find useful, “save” and “undo”, which are associated with the code

    sbPageEditor.saveOrCapture()

and

    sbPageEditor.undo()

respectively.

The underlying explanations

The “hard part” part (well, the non-obvious part) here was to find the code to activate the highlighting operations. That may be interesting to you if you want to repeat the process for other operations, perhaps in other extensions. There are many ways to go about doing this, one involving the DOM inspector in Firefox which is faster than the one I used. Mine involves exploring the extension source code, which I find instructive.

Firefox has a very extensible architecture. One key ingredient in this extensibility is XUL, the language used to describe components in the user interface. It’s real simple to grasp the basic principles, especially if you know HTML and JavaScript. Basically, extensions are programmed in XUL and JavaScript.

Firefox extensions may usually be found in your user directory (under Linux, by default, it’s in .mozilla/firefox/…/extensions, in your home directory). Their source code is right there for you to look at. Each extension is in its own directory, usually named with easy-to-remember strings like “{53A03D43-5363-4669-8190-99061B2DEBA5}” (here the Scrapbook extension directory name).

The code for ScrapBook is in there, in a JAR file (chrome/scrapbook.jar), which is just another extension for a Zip file, so you may open it and uncompress it with an unzipping utility. The file that interests us here is in this JAR file; it’s content/scrapbook/overlay.xul. This is the file that describes the way ScrapBook attaches itself to the main Firefox window, I believe.

Amongst other elements described in overlay.xul, you’ll find the toolbar which displays the highlight buttons (and “undo”/”save”). To find the line, search for “ScrapBookHighlighter”. The code for the operations is in the “oncommand” attribute.

To find the line and file of interest, which might seem non trivial given the number of files in there, I used a global search on files for the term “highlighter”.

Using xsel, xbindkeys and xmacro to insert common strings (date, name, etc.) in Linux

Goal: insert a date string (e.g. “2008/06/29″) in almost whatever text area/graphical program I’m using currently with a single keyboard shortcut.

Basic principle: copy the date string to the clipboard and emulate the Ctrl-V key combination, which activates the Paste operation in most programs.

Limitations:

  • The current program must support Ctrl-V as the Paste shortcut (won’t work on the command line, for example).
  • The current clipboard data is erased, replaced by the string inserted.

The bare steps I took

I’m using Ubuntu, but I guess this should work with most distributions, by adapting the installation instructions.

1. I installed the necessary utilities.

    sudo apt-get install xsel xbindkeys xbindkeys-config xmacro

2. I created a default xbindkeys configuration file.

    xbindkeys --defaults > ~/.xbindkeysrc

3. I wrote this shell script, which I saved on my local disk.

4. I configured xbindkeys using xbindkeys-config to launch the script when the Windows-D shortcut is pressed.

a. I loaded “xbindkeys-config” by typing that on the command line.

b. I created a new shortcut and associated it with the script.

c. I associated it with the Windows-D shortcut by pressing “Get Key” and then pressing Windows-D.

d. I saved and exited.

e. I arranged it so the “xbindkeys” command would run on every logon, which can be done by adding the line “xbindkeys” to a logon script (/home/…/.bash_profile, for example).

The underlying explanations

I often find it useful to insert the date in personal notes I take. So often, in fact, that it’s quite handy to automate the insertion. On Windows, there’s this handy app and scripting language called AutoIT which may be used to automate common tasks. But under Linux, quick googling doesn’t reveal any self-evident choice for an all-encompassing scripting language, so I went looking into more application-specific options.

The first key element to my solution, xsel, is a program that allows one to control the X selection and clipboard from the command line. The second utility, xbindkeys (and its configuration GUI, xbindkeys-config), as you guessed, allows you to associate keyboard shortcuts to commands. Finally, xmacro is a program that allows you to emulate specific keyboard key events, like key presses, and mouse events.

I therefore associated (using xbindkeys) a keyboard shortcut to a script that copies the date string to the clipboard (using xsel) and emulates the Paste keyboard shortcut (using xmacro). That’s a pretty complicated solution, but all in all it didn’t take so much time to set up. I commented the script so you may get a better understanding of the parameters used.

I could have used other shortcuts that Windows-D, but the Windows key isn’t used under Ubuntu, so this was a great occasion to capitalize on keyboard real estate.

As an alternative approach, which doesn’t replace the clipboard data, you can use xmacro to insert every single character in the date string, one at a time. I began by doing this, since it’s much simpler. The problem I ran into is instability: some programs need a delay between keypresses, otherwise they mix up the letters, and the insertion seems slow and sometimes misses letters. That limitation prompted me to try this solution.

If anyone has a suggestion as to how to improve this solution, feel free to post a comment about it.

Further reading and references

*bump*

This blog still lives on… in my heart :P Someday I’ll post something about the end of these adventures.

UPDATE (2008/06/07): I realize I never actually posted anything about said end of adventures. The fact is, the project came to an end mostly due to time considerations. I had a summer internship which used most of my time, so I could not put much more of it on the project in the following months, and a lot more work would have been needed in experimentation and tweaking to find out whether or not the idea actually worked… so basically the project was shelved.

Technical foundations ready

It’s been two months since my last post. This is not due to decreased interest, mind you, simply to the fact that creating the technical basis to experiment was rather long, and that in the meantime there wasn’t much progress to report on (I should seriously consider diversifying subjects on that blog, I know :P ).

So I went on with my plan: modifying the doxygen source code to extract information about the structure of the program. It wasn’t that complicated, but it took quite a while. The code isn’t commented that much, which is rather surprising for a program that specializes in extracting metadata from comments. It works fine, though, and that’s what I was looking for.

As I said, the little hack I created extracts information about function calls, functions and corresponding lines. I also preprocess the source code (target of analysis) to create a copy which contains only the words which I wan’t to consider.

With this information in hand, I will now be able to proceed with extraction of function bodies and body extensions.

There are a few limitations to this approach, though, and one that deserves mention is analysis of code containing polymorphism (creating pointers to objects with knowledge of only the parent class or interface, and not knowing precisely the object’s type until runtime). If a call is based on a pointer making use of polymorphism, it’s very difficult (or it just plainly makes no sense, in some cases) to extract links to the right classes/functions from the static source code — you may only know the parent class. That creates some noise in the data. For that reason, I’ll have to find a target source code for my experiments that doesn’t use polymorphism too much.

I now have to design detailed measures and statistics to be able to tell wether or not this approach* looks promising.

* At this point, I wouldn’t call it an “hypothesis”: it’s just a general path for further explorations. I’ll probably have to use hypothesises in the statistical sense, though.

Relationship between words and enclosing pieces-of-code

It’s been a while since I last posted, the reason being that I preferred to wait to be able to work on the project for a few days in a row, which I did two weeks ago, but I still needed some time. While I still don’t have any working code, my ideas on how to proceed are getting much clearer. I’ll first explain what I want to produce, why, and then how I intend to produce it. This post will be long, but it needs to be.

In the last post, I said I intended to use the structure found in the software (function calls, class instances etc.) to induce structure in the domain concepts (”files”, “windows”…), as a few papers suggested. There are, of course, quite a lot of ways of doing this. Fundamental to all of them, I guess, is the way the words (or, more specifically, concepts) found in the source code relate to the piece of code they are found in.

For example, if the body of a function contains the word “window”, in a comment, variable name, whatever, what does this mean about the function? Is is that the function uses “window”, creates “window”…? Of course there’s no obvious answer, even if we perform some (simple) analysis of the context in which the word is used. The real answer comes from “general understanding”, which is a pretty difficult thing to code.

There is, however, some information about this relation that we might gather from looking at the program structure alone. As I said earlier, if a function F1 calls a function F2, then we might say F1 needs F2. That’s not a relation of concepts, just a relation of functions. The goal is to end up knowing that F1 is strongly related to the concept “window” (for instance), and F2 to “file”, and, from the “needing” relation, induce that “window” needs “file”.

Still lots of ways to do it. Here’s my shot at it.

I concentrate on two types of concept-”piece-of-code” relation (”piece-of-code” being a function, class, file or directory):

  • A concept is represented by a piece-of-code
  • A concept is needed by a piece-of-code (that is, the piece-of-code needs another (or a few other) piece-of-code which represents that concepts)
  • (for sake of completeness, I also need to mention the “no strong relation” possibility)

The reason I chose these two, and no more, is that they provide useful semantic information, and we can gather that information by simply analyzing the structure of the code. I considered using “relates to” instead of “needed”, but I feel I’m losing meaning in the abstraction, with no apparent justification.

Now, a word on why this distinction is useful. In the previous approach I considered (LSA and clustering, see previous posts), a word was either related with a piece-of-code, or it was not. The relatedness was deduced from the presence of absence of said word from the piece-of-code under consideration. Afterwards, algorithms were applied which determined which words related more to which pieces-of-code. Frequencies were the key here. But you ended up with pieces-of-code-as-reprensented-by-XYZ-words, with no regards as to wether the concept was indeed descriptive of the piece-of-code, or simply very strongly related to it. My idea would be to be able to make such a distinction, to then have better data to be used in creating an ontology.

Problem is: I don’t know is if the linguistic information in the code (variable names, comments, mainly) is usually sufficiently clear to establish such a distinction in a rather simplistic programmatic way. That’s where experimentation comes in, and I’ll describe the way I intend to experiment first.

Now, to go any further, I need to introduce the concept of body extension* of a piece-of-code. Usually, the function body is simply the code between its enclosing braces (function foo(){ return bar; } -> “return bar;” is the body). Now, the body extension is the code not found in the body of the function, but found in the code surrounding it’s relations “spots” (function call, variable use). So, if function foo() is called the following way:

function caller() {

// We now call foo() to get bar

bar = foo();

}

then the comment preceding the call is part of the body extension of the function. That comment is also part of the body of of caller(), though. The intent, here, is to add the frequencies of the words in the calling environment (lines surrounding call) to the word frequencies associated with the function foo(). Seeing that those words are part of the body extension of some other piece-of-code, though, we could also decrease the importance of the link to the caller() function (by decreasing their frequency a bit, or something).

Now, having “cleaned” the word frequencies that way, if we apply algorithms based on frequency (ex: LSA and clustering, as mentionned before), the words will tend to be associated less with “needing” pieces-of-code, and more with “representing” pieces-of-code. Then, if a word is very frequent in a piece of code, but is not in “representing” relation to it, it must be that it is in “needing” relation with it. I must say, though, that this is certainly not the cleanest nor most interesting use of the body extension concept, but this post is getting longishly long, and I need to get down to the technical steps one day or the other. Note, also, that the whole point here is to see if this concept adds truly useful information, and, following experimentation, further guide development of some kind of ontology extraction method.

So here are the steps, in birds-eye view:

  1. Extract information about pieces-of-code, that is, extract the hierarchy of directories/files/classes/functions, and have a way to track down line numbers corresponding to those pieces-of-code.
  2. Extract information about pieces-of-code uses (function calls, class uses), and have a way to track down line numbers, here, too.
  3. Create a copy of the code, and process it by extending identifier (variable names, function names, etc.) to complete words as much as possible.
  4. Using line numbers found in steps 1 and 2, and using some definition for the body extension (say, 2 lines above and 1 below call), extract frequencies.
  5. Apply *something* to those frequencies. In my case it’ll probably be LSA and clustering, at first. Once here, I can also experiment with different combinations of body/body extension frequencies.

I intend to extract structure by modifying the source code of doxygen, a tool that generates documentation. The reason I chose this tool, and not some more specialized fact extractor or other reverse engineering software, is that doxygen keeps track of line numbers for function calls etc., fundamental to the body extension thing. It’s not the cleanest solution (since a byproduct of the operation is generating documentation!), but from what I gather, doxygen can extract structure from a few important languages, and does so in a robust manner, creating very useful data structure along the way (fully referenced function calls, extending from line #1 to line #2…).

So that’s what I intend to do. I might also completely change my mind if a better idea comes along, but I think to make further progress in creating an ontology, I need to clarify the relationship between words and pieces-of-code. Another important point is that the tools created along the way will be useful in my next steps, and that’s important since almost half of my time on that project has already been spent.

* I had first called it “extended body”, but as Michel noted, it makes more sense to say “body extension”.

Ontology learning and ontology-based software engineering

About a week and a half ago, I briefly met with Michel. I had a feeling of what I was going to do next, but wanted to know if he had particular suggestions. He told me to go ahead if I felt I knew what do, but, neverthless, he suggested I look into “ontology-based software engineering”.

I was talking earlier about concept extraction from source code. What I meant, in fact, is ontology extraction. An ontology, in this context, is a “specification of a conceptualisation of a knowledge domain”. Stated another way, it represents concepts and links between them for a given topic. For example, an ontology about computers could have concepts “laptop”, “desktop” and “server” specified as children of the concept “computer”, “children of” being a type of relation.

They (ontologies) have many uses, providing a common base of exchange on a given topic, like a protocol. As they are very formal things, computer programs can analyze them and “reason” with the data they provide.

It turns out a lot of research has been/is going into ontologies. Ontology-based software engineering is an important area of it. In fact, there’s a workshop on a strongly related topic being held at this very moment, the 2nd International Workshop on Semantic Web Enabled Software Engineering. Ontologies are a very important component of what is called the Semantic Web (see an intro here), and therefore a lot of the presentations are concerned with them.

(One paper is particularly interesting for those wanting to know how they might be of use in software engineering, by the way: Applications of Ontologies in Software Engineering.)

The idea of automatically extracting and learning ontologies from texts and other sources is not new (see a survey here). Doing it on software projects isn’t totally new either, but it looks rather fresh. A few papers I could find are directly concerned with it:

  1. Extracting Ontologies from Legacy Systems for Understanding and Re-Engineering
  2. Learning Ontologies from Software Artifacts: Exploring and Combining Multiple Choices
  3. Extracting Ontologies from Software Documentation: a Semi-Automatic Method and its Evaluation

Software is a very structured thing. And with this structure comes meaning. If function F1 calls function F2, we could say F1 needs F2. That’s a relation, right there. If class C1 inherits from class C2, well, C1, as a concept, could be seen as a subconcept of C2. Some of those relation almost directly map to ontological relations. The paper OWL vs. Object Oriented Programming explores some of those mappings.

When learning ontologies, though, we’re more interested in high level concepts, “graphical user interface” instead of “GUIClass”, or, worse, a bunch of functions with ten-feet-long names doing that particular job. So it isn’t as simple as creating a graph of function calls. As this paper puts it, we need to link the “structural component” to the “domain component”.

In that particular direction, some interesting work is presented in the second paper in the list above. Their approach is to merge words from multiple sources, structured (eg. source code) and unstructured (eg. mailing lists) and then do some filtering magic to select the main concepts forming the ontology (they explain it better in the paper :-P ). Writing about their future work, they mention, as that other paper above, exploiting inheritance for IS-A relationships.

* * *

Finally, coming back to my orienteering problem. In that approach I talked about in previous posts, the accent was on word frequencies, much less on structure. As there seems to be a lot of potential in exploiting the structure of software to infer structure in the “domain component”, I think that’s where I’ll be heading. Next, I’ll probably post about what particular approach I’ll be taking.

The approach, with some sort of explanation

The general idea of the approach, as I used it*, is rather intuitive: separate documents in word frequencies, then relate words to documents as they ought to be related (yes, this statement could be made clearer), then group documents by similarity. The detail is rather complicated, though.

Separating documents in word frequencies means extracting the number of occurences of each word from each document. Said and done.

The “relate words to documents as they ought to be related” part is tricky to explain. Well, in fact, it’d be best if I understood it completely myself, heh. It’s rather easy to implement, though (see the corresponding function in my code). It’s based on a technique called Latent Semantic Analysis (LSA), which takes as input a matrix of word frequencies as rows and documents as colums, or vice versa, and outputs a matrix of the same dimensions, but adjusted.

I’m not sure of all the details involved – why it works –, but here’s some sort of explanation of what it does:

Suppose you take a set of documents of 500 words about a specific sunny vacation resort. Those documents tend to contain about the same words: “beach”, “sun”, etc. But each document being slightly different from one another, the number of occurrences of each of those words vary from one document to the next. If you’re only interested in the topic, those differences are irrelevant to you, though. If, for example, one of the documents doesn’t contain a single occurrence of the word “sun”, but lots and lots of “sunny”, well, the difference is quite irrelevant indeed.

Now, the interesting part: applying LSA to the word frequencies observed in these documents would result in a new and adjusted frequency for the word “sun”, higher than it’s real and physical frequency. Doing this for all words in all documents of a set, what you obtain is distributions of frequencies that reflects more adequately a “desirable” distribution for a set of documents concerning different topics. So if you have documents about sunny vacation resorts, skiing resorts etc., the documents concerning similar topics would have more similar distribution of word frequencies after applying LSA. On the other hand, differences in word frequencies would be higher for documents concerning different topics.

This operation is performed by transforming the original matrix in three intermediate ones, which can then be shrunk to smaller size. After having shrunk them, you get a matrix of the original size back, but the process of shrinking the intermediate matrices results in a final one with less “noise”, meaning less of those undesirable differences I talked about above.

That’s what it’s supposed to do in most cases, anyway.

For something quite a bit more detailed, with a real mathematical example, see: An Introduction to Latent Semantic Analysis. My explanations for LSA are heavily inspired by a part of that document, by the way.

Now for the “then group documents by similarity” part. Here, what we want are clearly defined sets: one document is a member of one and only one set, at the end of the process. Following a suggestion of M. Gagnon, I’ve done it with a basic clustering algorithm known as K-means. I’ve used the python-cluster library to do this.

This algorithm takes a set of vectors in N-dimensional space (N being the number of words, here), and, given a desired number of groups K (hence K-means), associates each point to one of those groups, defining those groups in the process. This means you don’t know a priori what those groups will be, and might end up with groups that have no relation to the topics you first thought about.

* The real work, being a master thesis, is quite a bit more detailed, as I’ve said in the previous post.

First steps

I’ve been searching to know if others had done some work on linguistic information extraction from source code. I used to think of this as rather specific (read: unexplored), but I now stand corrected:

Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code

This a master thesis done by Adrian Kuhn of the Software Composition Group at the University of Bern. When I found out about this, I had reacted along the lines of “*argh*, it’s already been done!”. My project supervisor thought otherwise and suggested I use the skeleton of the approach taken by the author* to begin my exploration. Which I did, with simplifications**.

So here’s the code, as of now, written in Python: lsa_cluster.py (GPL). Please note that I haven’t yet tested it on a real set of documents, only on a tiny (tiny (tiny (…))) set of tiny strings (=9 strings of 10-15 words). It worked perfectly for that case, though :-P But its execution might take a while on a real set (hundreds of documents with thousands of distinct words). I have yet to assemble such a set to test it.

In the following weeks, I will try to post some general explanations of the topics involved.

* The real work by Adrian Kuhn, being a master thesis, is quite a bit more detailed, and is not simply a program. My code simply replicates the (very) general idea, as interpreted by me.

** In a comment, in my Python source code, you’ll find a list of the most notable simplifications involved.

Extracting concepts from source code

About myself, briefly, to give some context: I’m an undergraduate student in software engineering at École Polytechnique de Montréal.

For the following seven months or so, I’ll be working on a small (read: 4-5 hours a week) extra-curricular research project concerning “automatic methods to extract linguistic information from computer programs”. Stating it differently, that means I’ll be writing (or trying to write) programs to extract meaningful keywords, and relationships among them, from source code and related documents. This project is supervised by Michel Gagnon, a professor in the department of Computer Engineering at École Polytechnique de Montréal. He’s also the one who suggested the topic.

On another note, it has been a while since I wanted to start blogging, so I figured that this was the perfect occasion to do so. I’ll therefore be blogging about this project and interesting books, links and ideas I fall upon along the way. The audience I’ll have in mind is people with some familiarity with programming, but not necessarily with the specific topics I’ll be exploring.