Embodied-R1-3B-v1
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation (ICLR 2026)
[🌐 Project Website] [📄 Paper] [🏆 ICLR2026 Version] [🎯 Dataset] [📦 Code]
Model Details
Model Description
Embodied-R1 is a 3B vision-language model (VLM) for general robotic manipulation. It introduces a Pointing mechanism and uses Reinforced Fine-tuning (RFT) to bridge perception and action, with strong zero-shot generalization in embodied tasks.
Figure: Embodied-R1 framework, performance overview, and zero-shot manipulation demos.
Model Sources
- Repository: https://github.com/pickxiguapi/Embodied-R1
- Paper: http://arxiv.org/abs/2508.13998
- OpenReview: https://openreview.net/forum?id=i5wlozMFsQ
Updates
- [2026-03] VABench-P / VABench-V released: VABench-P, VABench-V
- [2026-03-03] Embodied-R1 dataset released: https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset
- [2026-01-27] Accepted by ICLR 2026
- [2025-08-22] Embodied-R1-3B-v1 checkpoint released
Intended Uses
Direct Use
This model is intended for research and benchmarking in embodied reasoning and robotic manipulation tasks, including:
- Visual target grounding (VTG)
- Referring region grounding (RRG/REG-style tasks)
- Open-form grounding (OFG)
Out-of-Scope Use
- Safety-critical real-world deployment without additional safeguards and validation
- Decision-making in high-risk domains
- Any use requiring guaranteed robustness under distribution shift
How to Use
Setup
git clone https://github.com/pickxiguapi/Embodied-R1.git
cd Embodied-R1
conda create -n embodied_r1 python=3.11 -y
conda activate embodied_r1
pip install transformers==4.51.3 accelerate
pip install qwen-vl-utils[decord]
Inference
python inference_example.py
Example Tasks
- VTG: put the red block on top of the yellow block
- RRG: put pepper in pan
- REG: bring me the camel model
- OFG: loosening stuck bolts
(Visualization examples are available in the project repo: assets/)
Evaluation
cd eval
python hf_inference_where2place.py
python hf_inference_vabench_point.py
...
Related benchmarks:
Training
Training scripts are available at: https://github.com/pickxiguapi/Embodied-R1/tree/main/scripts
# Stage 1 training
bash scripts/stage_1_embodied_r1.sh
# Stage 2 training
bash scripts/stage_2_embodied_r1.sh
Key files:
scripts/config_stage1.yamlscripts/config_stage2.yamlscripts/stage_1_embodied_r1.shscripts/stage_2_embodied_r1.shscripts/model_merger.py(checkpoint merging + HF export)
Limitations
- Performance may vary across environments, camera viewpoints, and unseen object domains.
- Outputs are generated from visual-language reasoning and may include localization/action errors.
- Additional system-level constraints (calibration, motion planning, safety checks) are required for real robot deployment.
Citation
@article{yuan2026embodied,
title={Embodied-r1: Reinforced embodied reasoning for general robotic manipulation},
author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Tang, Hongyao and Hao, Jianye},
journal={The Fourteenth International Conference on Learning Representations},
year={2026}
}
@article{yuan2026seeing,
title={From seeing to doing: Bridging reasoning and decision for robotic manipulation},
author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
journal={The Fourteenth International Conference on Learning Representations},
year={2026}
}
Acknowledgements
If this model or resources are useful for your research, please consider citing our work and starring the repository.
- Downloads last month
- 687