arxiv:2603.05757

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Published on Mar 5

· Submitted by

Harry Zhang on Mar 12

Northwestern University

Upvote

Authors:

Gehao Zhang ,

Zhenyang Ni ,

Abstract

A data-free framework aligns video generative model outputs with vision-language model constraints for improved robotic manipulation, achieving higher success rates through constraint-guided selection and trajectory optimization.

AI-generated summary

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

View arXiv page View PDF Add to collection

Community

Harry4582

Paper author Paper submitter about 18 hours ago

Excited to share our new paper: EmboAlign!

EmboAlign aligns video generation with compositional constraints for zero-shot robotic manipulation.

Our key idea:
VLMs provide structured spatial reasoning that complements VGMs. We use VLM-generated task constraints in two stages:
• Constraint-guided rollout selection
• Constraint-based trajectory optimization

On 6 real-robot manipulation tasks, EmboAlign improves success rate by 43.3 points over the strongest baseline, with no task-specific training data.

Paper: https://lnkd.in/gFvQg6He

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.05757 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.05757 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.05757 in a Space README.md to link it from this page.