RECOGNIZING DRAWINGS: DEEP LEARNING VERSUS BPS
In this video, we compare Gamalon’s new Bayesian Program Synthesis (BPS) technology versus state-of-the-art deep learning while playing Pictionary: we draw something, and the system must guess what we drew.
We show that the Gamalon BPS system learns from only a few examples, not millions. It can learn using a tablet processor, not hundreds of servers. It learns right away while we play with it, not over weeks or months. And it learns from just one person, not from thousands. Someday soon you might even have your own private machine intelligence running on your mobile device!
You have pictures in your imagination, but it is difficult to show your imagination to other people. We imagined this app during one of our Gamalon company hackathons, to make it easier to show other people what you meant to draw, instead of what you did draw.
A collaborative drawing system of this kind would quickly learn from all of the people using it, and rapidly become surprisingly helpful. It could offer autocomplete suggestions for your sketches, help fill in details or surrounding context, or clean up and enhance your drawings. If you are designing a building or a machine, it could do a hierarchical 3-D parts search and find similar parts to fit your needs. If you are creating a business document, full featured bar charts and pie charts could instantly pop into existence just by sketching them. With its knowledge of how parts work together, the system could even add the laws of physics to this sketching world, so that anything you draw instantly becomes animated and interactive.
Unlike deep learning which learns by adjusting millions of numerical parameters, the BPS system learns by (re)writing human-readable code, so we can examine and edit the new concepts that it learns. If one person taught the system something we don’t want it to know, we can simply remove the code that we don’t like.
Going beyond this drawing application, we are starting to teach the system to read, first by building up letters, then words, and then sentences. Language is a much more complex setting, but like with drawing, we expect that the system will learn more and more complex concepts made out of simpler ones. Who knows where it can take us?
BPS USES FAR FEWER TRAINING EXAMPLES THAN DEEP LEARNING
Recognizing abbreviations is a problem that comes up a lot in enterprise data, and it is essentially a machine translation task, similar to translating from, say, English to French. In this task, the system sees an abbreviation like “MA” and then must guess “Massachusetts”. We show that compared to state-of-the-art deep learning, Gamalon’s Bayesian Program Synthesis (BPS) requires vastly fewer training examples.
In the figure above, the horizontal axis is the number of examples that we provide during training. One training example gives the machine the right answer for one abbreviation, e.g. “Ave.” should be “Avenue”.
After both systems have seen a number of training examples, we let them run their learning algorithms for as long as they need to (e.g. TensorFlow ran for 40 minutes on a single core, Gamalon BPS ran for a hundred seconds), and then we give both systems a pop quiz. We provide 100 new abbreviations that the systems have never seen before, e.g. “MIT”, and they must try to guess the correct long-form phrase, i.e. “Massachusetts Institute of Technology”. The vertical axis in the figure above is the percentage that they get right on this quiz. Machine learning experts call this the “hold-out predictive accuracy.”
We tested two systems. Conferring with some friends at Google DeepMind, we designed a recurrent neural network (RNN), specifically a sequence-to-sequence long short-term memory (seq2seq LSTM) network, using Google TensorFlow. We compared this against Gamalon’s Bayesian Program Synthesis (BPS) system. We see that after just a handful of training examples, the BPS system is already as accurate as the deep learning system will be after it has seen 500 training examples. After 500 training examples, the BPS system is well over 90% accuracy, while the deep learning system is still only as accurate as the BPS system was after just a handful of examples.
Let’s extend the horizontal axis out to 6,000 training examples. We see that eventually, after about 2,000 training examples, the deep learning system has seen enough training examples to catch up to the accuracy of the BPS system.
But, if we make the problem more difficult by moving to abbreviations of 2-word phrases, e.g. “BU” from “Boston University”, then the deep learning system now needs 6,000 training examples to catch up to the accuracy of the BPS system.
As we continue to increase the length of the phrases that we are translating (horizontal axis in the figure above), the deep learning system requires a fast growing quantity of training data to achieve 99% accuracy. By contrast, in this example, the labeled training data needs of the BPS system grow near linearly to achieve the same 99% accuracy.
This comparison is not, in fact, specific to this machine translation task. For any kind of machine learning problem, as the problem gets more difficult, Gamalon’s BPS technology will require a great deal less training data than deep learning. The underlying explanation comes from a property that we call model capacity. Deep learning starts with a very wide-capacity model and narrows it through training, by tuning the parameters. In the end, it still has a large number of (tuned) parameters, the neural weights. By contrast, BPS starts with much narrower-capacity model subcomponents, and uses these to build up a model. In the end, Bayes Occam’s Razor will prefer the BPS model that fits more accurately with less training data and less training computation.
DEEP LEARNING IS PTOLEMY. BPS IS COPERNICUS.
Aristotle created one of the earliest models of the solar system, with the Earth (blue circle) at the center of the solar system, and with Mars (red circle) and the Sun (yellow circle) orbiting around the Earth on circular paths. The predictions from this model (black line in top cells of the video) does not fit the actual observed motion of Mars in the night sky very well. If we plot Mars’ location (declination) in the night sky versus time (red line in top cells of the video), Mars does not cycle up and down in a smooth wave, but exhibits “retrograde motion”; basically it wiggles.
Just like deep learning, given enough training data and enough epicycles, Ptolemy’s model will fit the observed motion and also predict the future motion of Mars just as accurately as our modern model of the solar system; but even though it can predict the data with perfect accuracy, you still would not teach Ptolemy’s model to your children… Why not?
If our modern theory of the solar system with the Sun at the center isn’t actually more accurate for predicting the angle that Mars makes in the night sky, what makes it “right”? One answer is called Bayes Occam’s Razor. The models of Aristotle, Copernicus, and Kepler are falsifiable scientific theories, they are either wrong or right, and they are “narrow” in model capacity (expect more on a formal definition of model capacity in future posts). By contrast, Ptolemy’s model, like deep learning, is so flexible that it is neither wrong nor right, it can explain anything as long as we adapt its huge number of free parameters to make it fit the data.
In science, we always prefer simpler falsifiable models over more complex models that can be neither confirmed nor denied. For example, ghosts and miracles can explain just about any observed phenomena, but as scientists we prefer narrower explanations. In machine learning we should do the same, we and our machines should be scientists in this way.
With deep learning, we are effectively adopting a Ptolemaic approach to machine learning. Wide-capacity ML such as deep learning, when deployed in real-world applications, can be dangerous and biased, because there is nothing to prevent it from learning things that are not true.
By contrast, in Bayesian Program Synthesis (BPS), the goal is to automate the scientific method, by rapidly iterating across falsifiable models, and actively seeking narrow model capacity while we also fit the data. Not only does this allow BPS to learn from far less training data, use far less computation, be highly personalized, and interact with people in real-time, it also provides a foundation for developing more ethical AI.
TEACHING ML/AI WITH BOTH DATA AND RULES
The world of machine intelligence is full of different approaches, contradictory ideas, and trendy systems. Let’s take a short walk through some of the big ideas to show you where Bayesian Program Synthesis (BPS) fits in.
Deep learning, neural networks, and regression are all examples of machine learning systems. They learn from labeled training examples. Essentially, a human points at, say, a picture of a chair, and says “chair.” Then the human must repeat this at least 10,000 times with 10,000 different chairs in order to teach the machine what a chair looks like. And then the human must repeat the process for kittens, marmalade, binoculars, and everything else in existence.
There is no way to improve on this process by explaining things to a machine learning system. “No actually that is a stool… and not a chair, because it doesn’t have a back” just won’t work. Machine learning systems cannot accept any kind of instructions to help them learn new concepts – they only learn from data. If a deep learning system were to drive your self-driving car off the road, there would be no way to find out why it did that, nor to give it instructions to avoid similar mistakes in the future. You just have to train it on more data and then hope that it doesn’t drive your car off the road again.
At the other extreme end of the spectrum of approaches for creating thinking machines are rules-based systems, AI expert systems, and Bayesian networks. Humans program in rules (often called a “model”), so that there is relatively little need for training data. Such model based systems are not able to learn a new rule or a new idea on their own in the way that a machine learning system can. Instead you have to explicitly teach it everything it knows. That’s a bit boring. It also makes the system expensive to build and maintain, because every new rule needs to be explicitly programmed in by a programmer or domain expert.
AI expert systems usually require 10,000 or 100,000 or a million rules! “Some chairs have 4 legs. A rare chair has 5 legs. A chair that has no back, is a chair and not a stool if it started out as a chair and someone broke the back off by accident. If they did that on purpose, maybe it is a stool now. Occasionally a chair has a back that is fully reclined, but it is still a chair and not a bed…” You can see that it gets complicated fast.
Rules-based AI systems and data-based machine learning both have disadvantages. Gamalon has created a new approach that we call Bayesian Program Synthesis (BPS). BPS learns generative Bayesian models from data.
We look forward to telling you more about it!
TEDx BOSTON: WHEN MACHINES HAVE IDEAS
Our CEO, Ben Vigoda, gave a talk at TEDx Boston 2016 called “When Machines Have Ideas” that describes why building “stories” (i.e. Bayesian generative models) into machine intelligence systems can be very powerful.
TALKING MACHINES INTERVIEW WITH BEN VIGODA
Listen to Katherine Gorman interview our CEO, Ben Vigoda, on Talking Machines.