When we look at a physical scene, such as the photo of a workshop, we can easily extract all kinds of information about the objects it depicts. We can guess the tools on a wall hang from hooks, that one of the big wrenches probably weighs more than a role of duct tape, or that items on the edge of the work bench are more likely to fall off than other tools further back.

While we have developed powerful technologies that employ some elements of artificial intelligence, none are intelligent enough to interpret these visual scenes as well as people do. But Joshua Tenenbaum, Professor of Computational Cognitive Science, aims to develop technology that can.

At the Dean’s Breakfast on April 21, 2016, Tenenbaum explained how he uses recent insights into developmental psychology, about the way we learn to look at and interpret visual scenes, to engineer those capabilities in a machine.

In our early infancy, we begin to develop powerful models for how the world works called our “common sense core,” which makes use of our past experience and observations to respond to our environment, make predictions about what will happen next, and plan for the future. Even our smartest robots lack these capacities, and can only accomplish the tasks they have been programmed for. They can’t adapt to novel situations or unexpected occurrences. For example, the assembly line robot Baxter is designed to be trained quickly by human workers to do rote tasks – which is impressive – but if Baxter drops an object, it’s stumped. In contrast, a toddler playing with blocks fixes mistakes as a matter of course, finding and picking up dropped blocks and rebuilding fallen towers, perhaps in new, more stable ways.

Tenenbaum aims to build models that work like our own common sense core by using computational tools of probabilistic programs based on Bayesian networks. Bayesian networks are a probabilistic graphical model that can be used as a general-purpose language for representing the structure of the world or as an algorithm used for inferences and decisions under uncertainty. However, the rules governing real-life situations are often too complex to be represented by a simple graph or an equation. Instead, Tanenbaum develops programs that make predictions based on a set of rules for how the world works while taking uncertainty into account.

Tenenbaum described one experiment in which he built a probabilistic model of our intuitive or “common” sense of how physics works that uses graphic simulations of several wooden blocks stacked in various configurations. People were shown the block towers and were asked to judge how stable they were. Because some configurations of blocks can look deceptively stable or unstable to most people, the test subjects were good – but not perfect – at guessing how stable each tower was. The intuitive physics model was also asked to rate the stability of the block towers. Tenenbaum then compared the responses of the model to those of people, finding that the intuitive physics engine approximated people’s guesses very well, regardless of whether those guesses reflected real, Newtonian physics or not. Tenenbaum and other scientists have applied the same probabilistic approach to other lines of inquiry about the block problem with success.

Tenenbaum’s intuitive physics model has a distinct advantage over conventional approaches to machine learning. Most of the conventional approaches rely on pattern recognition, the same technology that underlies Google image search and face recognition in Facebook. For example, Facebook is working on its own pattern recognition program for the block tower problem. The approach is successful for a few blocks, but runs into problems when the number of blocks increases or when you want it to make other kinds of predictions about the blocks, for instance how much of the tower will fall or which way it will fall. The problem is that pattern recognition programs are data-hungry. Each new line of inquiry requires massive amounts of data to make good predictions.

Unlike pattern recognition programming, humans don’t need massive amounts of new data every time they need to interpret their environment in a new way. You can ask most people to predict which way a block tower will fall, and they will make a decent guess, even if they’ve never had experience with that exact scenario before. People can build models quickly, often using just a few examples – or even just one – to make useful generalizations about how the world works, and apply them flexibly to new situations. This is called one-shot learning.

In a different experiment, Tenenbaum was able to use a probabilistic approach to develop character recognition software that is capable of one-shot learning. In this experiment, Tenenbaum used an “Omniglot” data set that borrowed more than 1600 handwritten characters from 50 different alphabets. A computer vision program based on Bayesian Program Learning (BPL) was developed to match or redraw characters from the Omniglot set after being shown only one or a few examples. In one of several tasks, both human subjects and BPL were shown a character and were asked to draw nine examples of that character. Then the samples were subjected to a Turing test: the samples were shown to judges, who would have to guess whether the characters were generated by humans or computers. In order for BPL to pass this Turing test, it wouldn’t be enough for BPL to draw one character the exact same way nine times, since no one writes a letter exactly the same way every time. BPL would have to generate some variation, while still reproducing a recognizable version of the character. BPL was able to fool the judges, who guessed which character sets were generated by machines and by humans just about as accurately as if they picked by random chance.

In the future, Tenenbaum hopes to expand the capabilities of his probabilistic programs, enabling them to make inferences more quickly and to accomplish more complex learning tasks, as well as to integrate them with neural networks and implement them in robots.