RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

CVPR 2025

1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Beijing Academy of Artificial Intelligence 3Institute of Automation, Chinese Academy of Sciences 4Institute of Information Engineering, Chinese Academy of Sciences 5The University of Hong Kong 6School of Artificial Intelligence, University of Chinese Academy of Sciences *Equal contribution Project leaders  Corresponding author

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.

RoboBrain Overview

teaser
RoboBrain consists of three key robotic capabilities for long-horizon manipulation tasks: planning, affordance perception, and trajectory prediction. Based on the ShareRobot dataset we have constructed, RoboBrain has achieved state-of-the-art performance in multiple robotic benchmarks through a well-designed multi-stage training process, realizing a cognitive leap from abstract instruction understanding to concrete action expression.

RoboBrain Model

pipeline
The pipeline of our RoboBrain. The images and videos are sent into our model to pre-train a foundation robotic brain. Besides, we fine-tune the RoboBrain via A-LoRA and T-LoRA to develop affordance perception and trajectory prediction skills. In practical applications, the model first generates detailed plans, and then splits it into sub-task descriptions to execute specific robotic tasks.
train
Detailed configuration for each training stage of the RoboBrain.

Data Distribution

In our multi-stage training strategy, the proportional distribution and compositional structure of the training data are pivotal to the performance of RoboBrain. The left-hand figure provides a visualization of the training data.



Evaluation Results

visualization
RoboBrain consistently outperforms all baseline models across three robotic task planning benchmarks, demonstrating its robust capability in handling complex, long-horizon manipulation task scenarios.

Visualization

visualization
This visualization illustrates that RoboBrain can interpret human instructions and visual images to generate action plans and assessments based on real-time image feedback. Furthermore, it predicts trajectories for each step and identifies corresponding affordances.
visualization
Additional embodied planning of RoboBrain. Here, we present additional embodied planning for robotic tasks generated by RoboBrain. In this figure, we demonstrate the planning results of RoboBrain for four distinct robotic manipulation tasks: "Water plants", "Put the pot in the drawer", "Cluster blocks of the same color into different corners", and "Clean the desk", where the first three are categorized as good cases.
visualization
Additional visualizations of diverse affordance areas. The text below each subfigure indicates the task instructions, while the red bounding boxes represent the affordance areas predicted by the RoboBrain model. The visualizations in the first three rows demonstrate that our RoboBrain model effectively identifies reasonable affordance areas based on human instructions and visual information. The fourth row presents several failure cases, which may stem from the model's lack of ability to perceive and localize in noisy environments. This limitation could be attributed to the absence of such scenarios in the training data used during Stage 4.
visualization
Additional visualizations of diverse 2D trajectories. The red-to-purple gradient curves represent the ground truth, while the green-to-blue gradient curves indicate the predicted trajectories. The visualizations in the first two rows demonstrate that our RoboBrain model effectively generates end-effector manipulation curves based on the robot's observations and task instructions. The third row shows that RoboBrain is not merely fitting trajectories but also exhibits the ability to generate more reasonable and feasible curves.

ShareRobot

dataset
The diversity of our ShareRobot dataset. Our dataset involves (a) 23 original datasets, (b) 12 embodiments and (c) 107 types of atomic tasks. The distribution of the top 20 most frequent atomic actions within our ShareRobot dataset is presented in (c).
dataset_pipeline
The generation procession of our ShareRobot dataset. Our dataset labels multi-dimensional information, including task planning, object affordance, and end-effector trajectories. The task planning is first annotated by atomic tasks and then augmented by constructing question-answer pairs. The affordance and trajectory are labeled on the images according to the specific instructions.
dataset_prompts
Additonal visualizations of prompts for Gemini. The prompts encapsulate the task description for robotic arm action recognition, the components of the target, and the desired response format. Additionally, an example is included to assist Gemini in understanding the specific task.
dataset_prompts
Templates of 10 question types. We have 10 question types for planning, each with 5 different templates to ensure the diversity of our ShareRobot dataset question formulations.

RoboBrain in the real world

In the future, we will further optimize the capabilities of RoboBrain to enhance its generalization and robustness as an embodied brain model. We will apply it to a wider range of real-world scenarios, providing stronger support for the development of robotics technology. We will continue to update this page.