Tests with basic multilayer perceptron
I’ve recently begun a MSc in computer science a Université de Montréal. I’ll be working in the LISA lab, concerned with machine learning. I’ll be producing weekly reports and I’ll be using this blog as a conduit for them, as I was doing a few years ago for UPIR (see first posts of the blog). Being reports, the audience I’ll have in mind is people already acquainted with the machine learning concepts involved.
* * *
I’ve been reading a bit during summer in the main reference book we’re using for our machine learning course (Pattern recognition and machine learning, by Christopher Bishop), yet I didn’t actually implement anything. Yoshua Bengio, my thesis supervisor, therefore suggested I do some experiments, starting off with multilayer perceptrons (MLP) and backpropagation, a very common approach in machine learning.
I’ve basically implemented the basics of the algorithm suggested in chapter 5 of Bishop for a two-layer (one hidden layer) MLP. Everything was pretty straightforward, except perhaps for handling of bias weights. I’ve been using Python, Numpy and Matplotlib, used in the lab and courses here.
I first trained the network to reproduce a sine wave over about one period, a regression problem. The network therefore had one input and one output. As predicted by Yoshua, at first performances were pretty poor, even though I could see it somehow worked (error went down with training), as some hyperparameters needed tweaking, notably:
- the number of hidden weights,
- the learning rate,
- the number of training steps,
- the type of activation function to use for hidden units,
- distribution for initial random network weights
At first I tweaked the hyperparameters by hand, but I quickly realized this would take eons, as performance for a given set of hyperparameters varied from one training to the next. So I wrote a class named HyperparameterVariator which, in conjunction with a simple loop, generated new sets of random hyperparameters given “Hyperparameters parameters” objects (e.g. the NUM_HIDDENS hyperparameter can vary from 5 to 30 etc.). It ran for about 40 minutes; I made it generate an HTML report with graphs to be able to see what predictions looked like for a given set of hyperparameters.
I ran the loop 30 times, each time trying a set of hyperparameters for 5 training runs. I was lucky: I found a set of hyperparameters which consistently produced very little error on regression for the sine wave. Here’s the curve (non probabilistic output) for the best set of weights generated, for example (blue is the output, green is the original wave):

Of course, for other hyperparameters, there were hundreds of curves which had extremely dubious similarity with the original sinus! At least, I had one set which gave good results:
LEARNING_RATE_NEG_EXPONENT: 2.86065622478 TRAINING_STEPS: 349.0 NUM_HIDDENS: 22.0 WEIGHTS_ORIGINAL_DISTR_STD: 0.9608633875 ACTIVATION_FUNCTION: TANH
I therefore modified HyperparameterVariator to allow for randomization of one parameter at a time, others keeping the “safe” value found above. Yet, even after running this a few times, the parameters above remained the best. Probably means I was lucky in that first totally random run!
The sinus problem having produced some satisfying results, I moved on to a classification problem. Pierre-Antoine Manzagol, another grad student in the lab, suggested I use the Iris dataset, which is a classic in the field.
This problem was less straightforward, since I needed to choose how to represent the output. I choose a 1-in-k scheme (e.g. (0,0,1)), with 3 outputs activated by a logistic function (with consequent gradient). After training, I simply chose the maximum output as the winning one.
Once again, performances varied a lot when training depending on hyperparameters, but here too I was able to reduce the error count using HyperparameterVariator. In the end it varied from 4 to ~25 on the training set of 150 data points.
Yet I hit a problem with confidence: too often the next-best choice for class was too close in probability (say 0.7 and 0.6). I tried normalizing the inputs, but altough it made confidence a bit better, it also increased my error count (I didn’t try reoptimizing hyperparameters, though).
I asked Pierre-Antoine for some hints, and he suggested I use the softmax activation function, saying its very common to do this for classification problems. It makes a lot of sense since it involves the notion of maximum output in the training itself. That’s what I’ll be trying next.

Leave a comment