Embodied-R1-3B-v1

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation (ICLR 2026)

[🌐 Project Website] [📄 Paper] [🏆 ICLR2026 Version] [🎯 Dataset] [📦 Code]


Model Details

Model Description

Embodied-R1 is a 3B vision-language model (VLM) for general robotic manipulation. It introduces a Pointing mechanism and uses Reinforced Fine-tuning (RFT) to bridge perception and action, with strong zero-shot generalization in embodied tasks.

Embodied-R1 Framework Figure: Embodied-R1 framework, performance overview, and zero-shot manipulation demos.

Model Sources

Updates


Intended Uses

Direct Use

This model is intended for research and benchmarking in embodied reasoning and robotic manipulation tasks, including:

  • Visual target grounding (VTG)
  • Referring region grounding (RRG/REG-style tasks)
  • Open-form grounding (OFG)

Out-of-Scope Use

  • Safety-critical real-world deployment without additional safeguards and validation
  • Decision-making in high-risk domains
  • Any use requiring guaranteed robustness under distribution shift

How to Use

Setup

git clone https://github.com/pickxiguapi/Embodied-R1.git
cd Embodied-R1

conda create -n embodied_r1 python=3.11 -y
conda activate embodied_r1

pip install transformers==4.51.3 accelerate
pip install qwen-vl-utils[decord]

Inference

python inference_example.py

Example Tasks

  • VTG: put the red block on top of the yellow block
  • RRG: put pepper in pan
  • REG: bring me the camel model
  • OFG: loosening stuck bolts

(Visualization examples are available in the project repo: assets/)


Evaluation

cd eval
python hf_inference_where2place.py
python hf_inference_vabench_point.py
...

Related benchmarks:


Training

Training scripts are available at: https://github.com/pickxiguapi/Embodied-R1/tree/main/scripts

# Stage 1 training
bash scripts/stage_1_embodied_r1.sh

# Stage 2 training
bash scripts/stage_2_embodied_r1.sh

Key files:

  • scripts/config_stage1.yaml
  • scripts/config_stage2.yaml
  • scripts/stage_1_embodied_r1.sh
  • scripts/stage_2_embodied_r1.sh
  • scripts/model_merger.py (checkpoint merging + HF export)

Limitations

  • Performance may vary across environments, camera viewpoints, and unseen object domains.
  • Outputs are generated from visual-language reasoning and may include localization/action errors.
  • Additional system-level constraints (calibration, motion planning, safety checks) are required for real robot deployment.

Citation

@article{yuan2026embodied,
  title={Embodied-r1: Reinforced embodied reasoning for general robotic manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Tang, Hongyao and Hao, Jianye},
  journal={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

@article{yuan2026seeing,
  title={From seeing to doing: Bridging reasoning and decision for robotic manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Acknowledgements

If this model or resources are useful for your research, please consider citing our work and starring the repository.

Downloads last month
687
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including IffYuan/Embodied-R1-3B-v1

Paper for IffYuan/Embodied-R1-3B-v1