Convolutional Neural Networks in PyLearn2 – Experiences with the Framework

I was working in recent days on experiments for large-scale Visual Object Classification (VOC) for Content-Based Image Retrieval (CBIR). Based on the commonly accepted high success rates of Convolutional Neural Networks (CNN), I tried to recreate Alex Krizhevsky’s Convolutional Neural Network for my case study.

In my work, I focus on performance impacts, considering the High-Performance Computing (HPC) challenge within the topic. Because of this and because of the time constrains in my work, I decided to build my case study upon the CIFAR-10 and CIFAR-100 challenges. I therefore do not focus on high success rates, but also because my research question within the topic differs from common Visual Object Classification challenge topics.

As a first hint, I may also add that the CIFAR-10 and CIFAR-100 datasets are relatively small (50000 samples), so although talking later about the success rates, I would like to remind the reader to keep in mind: Convolutional Neural Networks learn from the data. Thus, a large pool of image samples is certainly beneficial to achieve big success rates.

As a starting point, I choose PyLearn2 as implementation platform. Its framework design allows a quick and flexible implementation of a large variety of machine learning algorithms – from common regression tasks up to top-notch Deep Learning models. As the work focusses on HPC challenges, we come to the first caveat: installation.

1. Installing PyLearn2 on cluster environments

As you probably know, HPC and Big-Data servers do not commonly allow the installation of additional python modules (such as Theano and PyLearn2) in their environment due to the lack of root access. I ran into the problem earlier last year and started to seriously compiling my on OS-brew on the CentOS 6 cluster. I DO HIGHLY DISRECOMMEND DOING THAT! What I did not know back then: there is a solution – virtualenv. The package allows a „use locally from source“. So, one just needs to pull the sources (I used version 12.1.0 in my case, working on current CentOS6 clusters), follow the instructions and you can use a local installation environment. When you initialize (activate) the environment, you can run pip/setuptools commands without superuser access to install numpy, scipy, Theano and PyLearn2. It still uses the performance libraries of BLAS, LAPACK, ATLAS, Intel MKL etc., which are commonly available on the cluster.

Another novel possibility would be pylearn2-vagrant. Although this seems as a solution, I tried it out. Apart from vagrant- and virtualbox installation being necessary (and not available on clusters), the main drawback is that the virtual box image is a Disc image – disc in the sense of „read-only“. Hence, one can run networks, but as soon as the virtual box shuts down, all scripts created in the box are gone – I call that „suboptimal“.

2. Prepare the Data

As my research focus needs own datasets, I need to adapt the given datasets. This may be something also others users want to do (perhaps). Doing this goes in various steps.

a) Download the base dataset – use the download scripts provided by pylearn2 in the „/scripts/datasets“ subfolder.

b) sample and store your data – your thing. I stayed with the 1-batch, 1 file layout of CIFAR-100, while keeping the pickled dictionary to ‚data‘ and ‚labels‘. Also keep in mind to store your own brew of the test dataset, consider providing a separate validation set. And, of course, if you want to make sense of the data, store the dictionary in the meta file.

c) create your own dataset class – I copied the original layout of CIFAR-100, renamed the class, left all else pretty much as it is – AND RAN INTO A PROBLEM: CIFAR-100 doesn’t work out-of-the-box with CNN’s in pylearn2. When starting the learning, the monitoring stage of the Stochastic Gradient Descent will break with an error like: “ … expected IndexSpace to be 2D, got instead: <list of labels>“. The problem here is that pylearn2 really expects the y-set to be 2D (although we use a 1D VectorSpace usually to describe the multiclass classification output). The original part in the cifar100.py looks like this:

[45] X = np.cast[float32](X)

[46] y = np.asarray(obj[fine_labels])

my solution:

X = numpy.cast[‚float32‘](X)

y = numpy.zeros((X.shape[0], 1), dtype=numpy.uint8)

y[0:X.shape[0],0] = numpy.asarray(obj[‚labels‘]).astype(numpy.uint8)

This makes „y“ now appear to pylearn2 as a 2D dataset and the classification runs as you expect it to run. The trick is taken over from the CIFAR-10 dataset.

d) [optional] Make a preprocessed, batched dataset – if your dataset is small (as in this case) and you want to actively enlarge the sample set, you could, for example, create 28x28x3 samples from the 32x32x3 input images. If you let the dataset patcher from pylearn2 take that, he will extract the 28×28 from the 32×32 fields at random spaces – thing of it as clipped transformations of the images 😉  Sounds good so far, but that’s where the good part ends for me. I did not succeed in create this prepared dataset. Or, let me put it in other words: I did not manage to create a working set. The creation of the pickled subset works just as it should (use „/scripts/datasets/make_cifar100_patches.py“ as base layout for adaptation), BUT I did not manage to successfully load it in the network itself. Again, I did not care about horrible success rates as I measure score differences, but if you care just about the score, you may not get around this step.

Well, after c) [or d)], everything is in place to use the dataset. Be happy and take a tea on success when you’re at this stage.

3. The experiment itself – yaml it is!

pylearn2 comes with a (for a computer scientist) very easy way to describe the network: yaml files. It is, from where I’m standing, one of the biggest plus-points of pylearn2 – accessible use in comparison to other CNN libraries. In addition to the yaml-file, one can take inspiration from teh convolutional_network tutorial folder and create a suppert python-script that auto-loads the yaml-file and specifies some parameters. I hereby append my yaml-file. as I said, it resembles an AlexNet. The parameters are not the same cause higher numbers were not manageable to crunch in time – and also because of …

4. Small datasets, dataset scores, massive overfitting, and parameter gamble …

So, the score. Well, I subsampled the small CIFAR-10 cause it was my objective to do so. my base is hence 1000 sample images, CIFAR comes with 50000 max. That is not an aweful lot, particularly not for an AlexNet, which runs on the >150,000 sample ILSVRC challenge and ImageNet. Little math background: The AlexNet defines something around 6,000,000 free parameters (don’t be picky with me about the exact number cause …), we have 1000 (50000) x 32x32x3 input samples. We call this „massive overfitting“, meaning that we have too less data for too many features to learn. Therefore, in the CIFAR-case, it also makes no sense to resemble the full AlexNet (I think), cause even my reduced version is massively over-fit.

My numbers are quite horrible: misclassification at around 89%. I know, this is bad – again, look at the numbers and one can count to 3 why this is like that. Again, didn’t fully care cause the objective for me is different.

Ideas how to actually get good numbers ? A) Fight overfitting, with many-patch sampling, windowing, mirroring – try yourself out on the preprocessing. Always keep in mind: It is „learning from the data“, so the more data one gives the model, the more it will pay off. Another point (B): deep-study the parameters of your model and your network. Understand what parameters are having what effect, try ouy different cost modifiers (such as a good dropout), and if you use normal CNN’s, learn about Gradient Descent. Gradient Descent gives heavily different results depending in weight initialisations, learning rate and momentum (the momentum in which the learning rate slows over the iterations). Again, this post is no course on Machine Learning (go to coursera and pick one, if necessary), but these parameters can easily make 30-40% score difference. Nevertheless, it will not help as much as A) fight overfitting …

I hope this helps as a small guideline. A general remark: When adapting your classes and debugging, the network usually taps out at the beginning, delivering some errors in iterator. Follow the traceback to see which class entered the iterator. From my experience, most adaptation happens when setting up the monitoring of the SGD (stochastic gradient descent), which is often where the error originates from. A tap-out in the monitoring also indicates quite well that something in the dataset class setup is making mess.

Dieser Beitrag wurde unter English, Programming, Work abgelegt und mit , , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.