MIT's new AI can teach itself to control robots by watching the world through their eyes — it only needs a single camera

A soft robotic hand curls its fingers to grasp a wooden pencil.

(Image credit: MIT CSAIL)

Scientists at MIT have developed a novel vision-based artificial intelligence (AI) system that can teach itself how to control virtually any robot without the use of sensors or pretraining.

The system gathers data about a given robot’s architecture using cameras, in much the same way that humans use their eyes to learn about themselves as they move.

This allows the AI controller to develop a self-learning model for operating any robot — essentially giving machines a humanlike sense of physical self-awareness.

Researchers achieved this breakthrough by creating a new control paradigm that uses cameras to map a video stream of a robot’s "visuomotor Jacobian field," a depiction of the machine’s visible 3D points, to the robot’s actuators.

The AI model can then predict precision-motor movements. This makes it possible to turn non-traditional robot architectures, such as soft robotics and those designed with flexible materials, into autonomous units with only a few hours of training.

"Think about how you learn to control your fingers: you wiggle, you observe, you adapt,” explained Sizhe Lester Li, a PhD student at MIT CSAIL and lead researcher on the project, in a press release. “That’s what our system does. It experiments with random actions and figures out which controls move which parts of the robot."

Typical robotics solutions rely on precision engineering to create machines to exact specifications that can be controlled using pre-trained systems. These can require expensive sensors and AI models developed with hundreds or thousands of hours of fine-tuning in order to anticipate every possible permutation of movement. Gripping objects with handlike appendages, for example, remains a difficult challenge in the arenas of both machine engineering and AI system control.

Understanding the world around you

Using the "Jacobian field" mapping camera solution, in contrast, provides a low-cost, high-fidelity solution to the challenge of automating robot systems.

The team published its findings June 25 in the journal Nature. In it, they said the work was designed to imitate the human brain’s method for learning to control machines.

Our ability to learn and reconstruct 3D configurations and predict motion as a function of control is derived from vision alone. According to the paper, “people can learn to pick and place objects within minutes” when controlling robots with a video game controller, and “the only sensors we require are our eyes.”

The system’s framework was developed using two to three hours of multi-view videos of a robot executing randomly generated commands captured by 12 consumer-grade RGB-D video cameras.