Two important strands of our product development strategy are to use the best simulators to develop and test algorithms and to understand what the competition has to offer. Recently, Michael Bate, Artificial Learning’s Chief Mathematician, combined both by attending an excellent 2-day workshop on practical deep learning offered by Nvidia in partnership with Persontyle. Here is his light-hearted report…
Random Thoughts on Deep Learning
One way to see many of the historic sights of London is from the top of an open air bus. The cost of around £20 excludes an umbrella. Alternatively, catch a number 11 bus from Chelsea to Bank and sit upstairs in the dry for about £2. Do it in the rush hour and you can admire the view – actually mostly building sites – for ninety minutes. But if you just want to get to the workshop on time, take the Tube.
I’ve missed breakfast, but ready to start Day 1 of the Fundamentals of Practical Deep Learning, a two day workshop sponsored by Nvidia. Our lecturer, Tapani Raiko, has a mission: to make the singularity a reality. The singularity in question is the event when machines recursively self-improve, engendering a run-away effect – in other words the creation of superintelligence. To prepare for this event, which is not welcomed by everyone, one can study singularitarianism at the Singularity University. Let’s see if, after two days, Tapani brings the moment closer.
By name-dropping Hinton, Bengio and LeCun – the three giants of Deep Learning – Tapani scores a hat-trick in the first five minutes as without them we wouldn’t be here! And without seeing Hinton’s Google TechTalk in 2009 I might not be here. Basing his talks on their 2015 paper in Nature is a great start. You can catch up with all three of them on the Talking Machine blog.
If you don’t know much about your audience, presenting overviews can be awkward but, despite its technical nature, presenting an overview of deep learning has two advantages. First, its history is short – less than a decade. Second, it can be illustrated by pictures, diagrams, relatively few equations and an abundance of everyday activities such as object recognition and natural language processing. It is no coincidence that deep learning has succeeded at tasks that come easily to humans. Neural networks are to minds what aeroplanes are to birds: one inspires the other, one shows the other what is possible, and both are empirical sciences. Aerodynamics has been researched for over one hundred years. We are at a stage with deep learning comparable to the earliest attempts at flight.
No free lunch?
I’m hungry. Food is very good, but conversation a bit awkward. IT folk are introverted types by reputation. In answer to the psychometric question, “On Saturday evening do you party or read a book?” many, including me, prefer the latter. But look at it another way: sometimes at the cost of knowing something about everything, we know a lot about something, so asking someone their interest in deep learning should have been an ice-breaker. It wasn’t. A start-up employee is understandably reluctant to engage in free exchange when I innocently enquire who his clients are. Sorry. Fortunately, there is no such reticence among academics discussing, coincidentally, the no free lunch theorem. Briefly, this states that there are no universally good learning algorithms; no one-size fits all. This may be unwelcome to singularitarians, if true, but it is good news for deep learning consultants.
Doctor, doctor I’m confounded
This afternoon is about model selection. I didn’t know that all those methods could be grouped under the heading of regularisation, but if regularisation is the art of reducing overfitting, then that is where they belong. But then the list omits the most important regulariser of all: the data.
All models are wrong, some are useful, as George Box said in 1978. Do all learned models invariably suffer high bias (underfit) or high variance (overfit), or both? Take a simple example: do I have hypertension (high blood pressure)? Regular readings on a home machine averaged 140/90, but with high variance, even within a period of ten minutes. After taking his own readings three times, the doctor declared my BP to be normal at 127/85. Relief gave way to puzzlement: why the discrepancy? Maybe one or both machines are faulty; maybe I experienced white-coat syndrome; maybe it depends on the time of day. What is my true BP? Certainly my model is underfitted: it needs more parameters, it needs more data, including from other BP machines. It is also overfitted, since the doctor records only the most favourable one of several readings within the space of five minutes. To confuse matters further, when asked why machines might differ in their readings, he informed me that some cheap machines don’t correctly measure the BP of people with a deep pulse. When I asked if I have a deep pulse, given the discrepancy in the readings, he said probably. So, assuming the doctor and BP machine are both correct, I can classify BP machines by reading off my own blood pressure. More probably George Box is right.
CPU v GPU
Alison is going to tell us about the benefits of parallel computations. The seminal paper on the philosophy of connectionism, Distributed Representations, provides a key to the significance of massive parallelism in implementing best-fit searches – just what is needed for efficient deep learning. The closer the hardware physically resembles the network architecture and its asynchronous parallel processing, the faster the throughput. Nvidia, the world’s leading supplier of general-purpose graphics processing units, is positioning itself to dominate the market for deep learning applications with its Maxwell chip architecture and library of primitives, cuDNN. Since I’m leaning towards Theano as platform of choice, it’s interesting to read this comment from Bengio:
“What’s great about cuDNN is that it achieves cutting-edge performance while working easily within Theano’s memory management architecture.”
In fact, so appealing is the Nvidia-CuDNN-Python-Theano solution, I’m even prepared to re-write my MatLab algorithms in Python. Furthermore, much of the code I need will soon be written up in a forthcoming textbook of deep learning from Bengio et al.
The workshop on Caffe is not going well. Most people are faster than I am at following instructions, and time pressure makes it worse. Still, I get the general idea, and will complete the task this evening.
On the way to work on day 2, by Tube, I read that Marvin Minsky died on 24 January 2016. Described by MIT News as “the father of AI”, some will link him with the XOR affair, and some even blame him, in part, for the AI winter. This is unfair. All he did was provide a formal proof demonstrating that one couldn’t model the XOR function in a single-layer perceptron. Convincing oneself of this informally is not hard, nor is it hard to show that a two-layer perceptron can learn to model XOR.
Minsky is described as a cognitive scientist specialising in AI, in other words he studies the mind as if it were a data processing machine which, given his atheism and his faith in scientific reductionism, he believes it to be. I agree. The speed with which Newell and Simon’s physical symbol system hypothesis has been abandoned and replaced by connectionism owes much to Minsky’s research. Perhaps he is another giant without whom we would not be here learning about deep learning.
Mean cats and dogs
“What,” asks Tapani, “does the average of all these cats, on the left, and dogs, on the right, look like?” Imagine the trains of thought cascading through the layers of the neural network that is our mind during Tapani’s exquisite pause. Does it have eyes, or ears? Would it be ugly or beautiful? Would it have a face at all? My train of thought starts with a scan of cartoon characters from my childhood, before eventually settling on a Yoda-like creature. I can’t help noticing that some images on either side resemble Donald Trump, but that could be bias. The answer is much less startling – just a grey blurry blob; still, I got the colour right with Yoda. This recalls a theory of Hinton’s expounded during his 2012 online course in Neural Networks for Machine Learning that, despite their current success, Convolutional Neural Networks (CNNs) are doomed. In particular they will never be good at face recognition. Why? This is because pooling in CNNs averages input over a local area, which helps to generalise shapes and textures but loses the precise spatial relationships between higher level features needed for face recognition such as eyes and mouths. More generally, CNNs are poor at extrapolating: after seeing a new shape, they cannot recognise it from a different viewpoint. To overcome this, they must be trained on many different orientations, leading to huge training sets.
Just a minute!
Now it’s Pyry’s turn and he’s talking about Recurrent Neural Networks (RNNs). He invites us to talk to our neighbour for one minute on, for instance, algorithms for training RNNs. Despite knowing a bit about the subject, Paul Merton could have made a better fist of it than either my neighbour or me. The answer seems to be that only the backpropagation algorithm is used to train a RNN, though even this suffers from saturation. But all is not lost, Pyry has some useful tricks of the trade.
It’s a pity we aren’t hearing more about Restricted Boltzmann Machines (RBMs) and their role in deep unsupervised learning. To be fair, if they are not yet practical then there is no point in including them here even though they are beautiful machines with an aesthetically pleasing symmetry, an efficient learning algorithm (contrastive divergence) and can approximate any finite distribution. I think of them as the Lego bricks of deep learning and one day you’ll be able to buy a box of them from Maplin.
More workshop. Yesterday was a trip into the AWS cloud. Unfortunately, I neglected to logout last night and so my two hours were up before I could complete the Caffe workshop. Rather than trying to beat the clock, I hack on benignly.
It’s all over and we’re off to the pub for a complimentary drink. I find myself at the top table, talking to the presenters and organisers. They are curious to know my background and why I’m here asking awkward questions. As co-founder of Artificial Learning, developing hardware implementations of deep learning algorithms (Machine Learning on a Chip), it might seem we are in competition with Nvidia. Perhaps we are on the same course to a deeply learned future but a different tack. Nvidia’s chip designs are digital and exact while ours are analogue and stochastic and for all that, we argue, fundamentally more efficient. My objective in the workshop is to identify the best high-performance platform for simulating learning algorithms. If I win the raffle for the Titan X that would be nice.
After these two days is the singularity any closer? Perhaps, if there is another Marvin Minsky in the audience. Well, it will come when it comes, and the later it comes, the better prepared we will be. So perhaps I don’t share Tapani’s mission, but I thank him, Alison and Pyry for their enjoyable set of talks. Did I achieve my objectives? Yes. With the added bonus of trip around London from the top of a number 11 bus.