Oh joy – Alphabet Inc.’s AI subsidiary DeepMind has taken another step forward in understanding the world the way humans do.
The company that has previously made headlines for developing a system that teaches AI how to play video games — correction, how to crush it at video games — is now better able to “see” and understand a space and environment. DeepMind’s scientists have built an artificial vision system that can take something like a two-dimensional photo and from that construct a 3D model of a scene.
London-based DeepMind published details about this new system called the Generative Query Network in the journal “Science” today. The company also walked through some of the details on its own blog, explaining how the system can take images of a scene and build a 3D view from different viewpoints.
“Today,” DeepMind notes via the blog, “state-of-the-art visual recognition systems are trained using large datasets of annotated images produced by humans. Acquiring this data is a costly and time-consuming process, requiring individuals to label every aspect of every object in each scene in the dataset. As a result, often only a small subset of a scene’s overall contents is captured, which limits the artificial vision systems trained on that data. As we develop more complex machines that operate in the real world, we want them to fully understand their surroundings: where is the nearest surface to sit on? What material is the sofa made of? Which light source is creating all the shadows? Where is the light switch likely to be?”
What it sounds like — to someone, admittedly, who’s not a scientist — is this is almost a way of encouraging an imagination of sorts in an AI system. A machine like this approaches so many things fresh, without a learned experience or body of knowledge to draw from or even use to help make guesses about the world. They need to be taught how to imagine, how to make guesses based on what they “see,” and this seems like a way of doing that. (A potentially scary way of doing that, depending on how you feel about machines learning and getting more human-like).
In a statement accompanying today’s release of the research, the paper’s lead author Ali Eslami notes that one of the findings include that deep networks are in fact able to learn about things like perspective and lighting without any human engineering.
“The proposed approach,” according to DeepMind, “does not require domain-specific engineering or time-consuming labelling of the contents of scenes, allowing the same model to be applied to a range of different environments. It also learns a powerful neural renderer that is capable of producing accurate images of scenes from new viewpoints.
“While there is still much more research to be done before our approach is ready to be deployed in practice, we believe this work is a sizable step towards fully autonomous scene understanding.”