Weekly study log (Oct 2, 2009)

This is a “weekly report” for the lab I study in, mostly intended for other lab members. See the first one for further explanations.

Readings

This week I read papers related to a term project proposed to me by Yoshua for my machine learning course.

  • Jason Weston, Frédéric Ratle, Ronan Collobert: Deep learning via semi-supervised embedding. ICML 2008: 1168-1175
  • Hossein Mobahi, Ronan Collobert, Jason Weston: Deep learning from temporal coherence in video. ICML 2009: 93
  • Ranzato, M. A., Huang, F. J., Boureau, Y. L., and Lecun, Y. (2007). Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1-8.

For the first two, I understand pretty well the essence (which Yoshua had summarized to me and Razvan), but many background and comparative techniques are mentioned which I’m not familiar with. The “sparse features” paper is also pretty clear, but in all 3 cases I’ll need to dig deeper to really be able to implement all of this (I skipped some sections to concentrate on the gist of the methods, so I can write about them in my project description).

I also read the first half of this paper, so I got an idea of the various NLP tasks they approached simultaneously (I stopped to concentrate on the above afterwards):

  • Collobert, R. & Weston, J. (2008), A unified architecture for natural language processing: deep neural networks with multitask learning., in William W. Cohen; Andrew McCallum & Sam T. Roweis, ed., ‘ICML’ , ACM, , pp. 160-167

General course/curriculum work

  • Me and my teammate progressed in our Shannon’s game assignment for the NLP course; pretty solid foundations are there. Nevertheless, more work is needed for fancier frequency-based models.
  • Finished the Machine learning homework.
  • Finished assembling the elements for the scholarships processes. I still need to write some text about my earlier projects etc., but the rest is there.

Plan for next week

I plan on continuing work on the NLP course project. I must also elaborate a plan for the Machine learning term project and start exploring the underlying codebase (Theano and some code examples I’ll ask Razvan about).

Google scholar BibTeX bibliographical record in XMLJason Weston, Frédéric Ratle, Ronan Collobert: Deep learning via semi-supervised embedding. ICML 2008: 1168-1175

Weekly study log: MLP continued

This is a “weekly report” for the lab I study in, mostly intended for other lab members. See the first one for further explanations.

Lab-related projects

As said in the last post, for my MLP experiments I had started work on classification of the Iris set. Performance was pretty poor, so Pierre-Antoine suggested I use a softmax activation function for outputs. I did it this week, using the proper gradient (he sent me a document concerning this). It worked almost right away, with training/test error very near what I obtained with other methods (histogram-based, Bayes’ classifier with multivariate gaussians) in lab work for IFT6390.

I also started working on implementing other tricks found in Efficient backprop (LeCun), but I had little time and I realized I’d better work on visualization and/or find better ways to see by myself why the tricks work (why using softmax produced such results, for example). Blind implementation is not going to help me much for learning purposes. I’ll have to think about it.

Readings

  • I read most of “Extracting and composing robust features with denoising autoencoders” by Pascal Vincent et al. I spent quite a while pondering over some of the alternative mathematical perspectives (information theory, stochastic operator), trying to understand them correctly. I asked a few questions to Pascal directly, and after some more brain cycles I think it’s pretty clear.
  • I read the first half of “Loss Functions for Discriminative Training of Energy-Based Models” by LeCun and Huang. My desire was to understand what energy-based models are, and the advantage offered by that perspective on learning.
  • I read rather thoroughly some LISA’s projects descriptions sent to me by Yoshua.

General course work

  • I began the first homework for the Machine learning class. I’m using it as an occasion to learn to work with LaTeX (I know basic math syntax, but not much more).
  • We’ve got a work assignment in the NLP class where we need to predict letters one by one in SMS text (Shannon’s game) by developing a suitable language model. I started work on this with a teammate.

Plan for next week

In terms of projects, I’ll be working mostly on the Shannon’s game program, trying to get as far as I can on this front. Papers to read: to be done.

Tests with basic multilayer perceptron

I’ve recently begun a MSc in computer science a Université de Montréal. I’ll be working in the LISA lab, concerned with machine learning. I’ll be producing weekly reports and I’ll be using this blog as a conduit for them, as I was doing a few years ago for UPIR (see first posts of the blog). Being reports, the audience I’ll have in mind is people already acquainted with the machine learning concepts involved.

*  *  *

I’ve been reading a bit during summer in the main reference book we’re using for our machine learning course (Pattern recognition and machine learning, by Christopher Bishop), yet I didn’t actually implement anything. Yoshua Bengio, my thesis supervisor, therefore suggested I do some experiments, starting off with multilayer perceptrons (MLP) and backpropagation, a very common approach in machine learning.

I’ve basically implemented the basics of the algorithm suggested in chapter 5 of Bishop for a two-layer (one hidden layer) MLP. Everything was pretty straightforward, except perhaps for handling of bias weights. I’ve been using Python, Numpy and Matplotlib, used in the lab and courses here.

I first trained the network to reproduce a sine wave over about one period, a regression problem. The network therefore had one input and one output. As predicted by Yoshua, at first performances were pretty poor, even though I could see it somehow worked (error went down with training), as some hyperparameters needed tweaking, notably:

  • the number of hidden weights,
  • the learning rate,
  • the number of training steps,
  • the type of activation function to use for hidden units,
  • distribution for initial random network weights

At first I tweaked the hyperparameters by hand, but I quickly realized this would take eons, as performance for a given set of hyperparameters varied from one training to the next. So I wrote a class named HyperparameterVariator which, in conjunction with a simple loop, generated new sets of random hyperparameters given “Hyperparameters parameters” objects (e.g. the NUM_HIDDENS hyperparameter can vary from 5 to 30 etc.). It ran for about 40 minutes; I made it generate an HTML report with graphs to be able to see what predictions looked like for a given set of hyperparameters.

I ran the loop 30 times, each time trying a set of hyperparameters for 5 training runs. I was lucky: I found a set of hyperparameters which consistently produced very little error on regression for the sine wave. Here’s the curve (non probabilistic output) for the best set of weights generated, for example (blue is the output, green is the original wave):

Output vs original sine wave

Of course, for other hyperparameters, there were hundreds of curves which had extremely dubious similarity with the original sinus! At least, I had one set which gave good results:

LEARNING_RATE_NEG_EXPONENT: 2.86065622478
TRAINING_STEPS: 349.0
NUM_HIDDENS: 22.0
WEIGHTS_ORIGINAL_DISTR_STD: 0.9608633875
ACTIVATION_FUNCTION: TANH

I therefore modified HyperparameterVariator to allow for randomization of one parameter at a time, others keeping the “safe” value found above. Yet, even after running this a few times, the parameters above remained the best. Probably means I was lucky in that first totally random run!

The sinus problem having produced some satisfying results, I moved on to a classification problem. Pierre-Antoine Manzagol, another grad student in the lab, suggested I use the Iris dataset, which is a classic in the field.

This problem was less straightforward, since I needed to choose how to represent the output. I choose a 1-in-k scheme (e.g. (0,0,1)), with 3 outputs activated by a logistic function (with consequent gradient). After training, I simply chose the maximum output as the winning one.

Once again, performances varied a lot when training depending on hyperparameters, but here too I was able to reduce the error count using HyperparameterVariator. In the end it varied from 4 to ~25 on the training set of 150 data points.

Yet I hit a problem with confidence: too often the next-best choice for class was too close in probability (say 0.7 and 0.6). I tried normalizing the inputs, but altough it made confidence a bit better, it also increased my error count (I didn’t try reoptimizing hyperparameters, though).

I asked Pierre-Antoine for some hints, and he suggested I use the softmax activation function, saying its very common to do this for classification problems. It makes a lot of sense since it involves the notion of maximum output in the training itself. That’s what I’ll be trying next.