Figure 02: Observations, questions, and speculations

Figure 02
Figure 02

Yesterday, Figure revealed the second generation of its humanoid robot, and the demonstration is stunning. In two, years, the startup has made impressive progress and is catching up with some of the most prestigious robotics companies.

Figure 02 has a better robotics body, improved hardware, and many new features. However, Figure will have to do much more than demos to prove its new robot is promising. And the video also leaves a lot of room for speculation. Here is what we know and can guess about the new robot.

Robot upgrade

The most impressive part of the new announcement is the new robot, with a sleeker design, better battery life, and hands with more degrees of freedom. The robot will reportedly work for up to 20 hours with a single charge. The packaging has also been upgraded to hide the wiring and reduce safety risks. 

Figure 02 has six cameras placed in the head, front torso, and back torso, which gives it a 360-degree view of its surroundings. The robot’s gait is still a bit clunky in comparison to Tesla’s Optimus and Boston Dynamics’ new Atlas. But it is still remarkably good.

With advances in vision-language models (VLMs) and vision-language-action models (VLAs), a lot of the discussion surrounding robotics has shifted to the machine learning layer. But pure robotics engineering and control still remains one of the hardest challenges of the field. And in this regard, Figure has done a great job.

Where is the model?

According to the demonstration video, Figure 02 is equipped with three times more compute power than its predecessor. It also runs its own onboard VLM and speech recognition models. However, the details are a bit confusing.

The original Figure robot reportedly used GPT-4 on the OpenAI cloud. In an X thread, Figure founder Brett Adcock posted a diagram of how the new robot works, which includes “Onboard mics + speakers connected to custom AI models trained in partnership with OpenAI.” Another post in the thread states that the onboard VLM “enables semantic grounding and fast common-sense visual reasoning from robot cameras.”

However, OpenAI is not in the business of deploying on-device models, and the diagram makes it clear that the reasoning and planning are done by OpenAI models before sending commands to the robot’s control system. 

It will be interesting to see if there is a division of labor between cloud-based and on-device models. The arrangement of the input sensors will also be interesting to see. There is an open discussion on where to fuse different modalities. Does figure use early fusion, where vision and language are processed separately and blended, or will it do late fusion, where all modalities are tokenized and embedded together?

Learning on the job

How does Figure 02 learn to perform new tasks? The video shows a video of the robot working in a BMW factory with the label “100% autonomous neural network learned placement” and “self-correcting learned behavior.”

Another diagram in the X thread suggests that the model has a DataOps and MLOps pipeline that continuously gather new data, train new models, and deploy them on the fleet of robots. 

However, this raises more questions about the partnership between OpenAI and Figure and what kind of model training support they are providing them with. More details on this front could hint at OpenAI’s future plans for robotics.

I’m also interested to know if the robot uses any kind of in-context learning to dynamically adjust its behavior. One of the important features of LLMs and VLMs is to use examples and observations in their prompts to correct their responses. The models could use automated prompting techniques to analyze the outcome of their actions and adjust the robotic commands.

Exciting work, lots of questions. I’m excited to see how it pans out.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.