AI & ML interests
Feeling and building the multimodal intelligence.
Recent Activity
View all activity
Papers
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5
-
mvp-lab/LLaVA-OneVision-1.5-Instruct-Data
Viewer • Updated • 21.9M • 92.9k • 75 -
mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M
Viewer • Updated • 91.5M • 147k • 71 -
lmms-lab/LLaVA-OneVision-1.5-8B-Instruct
Image-Text-to-Text • 9B • Updated • 74.9k • 64 -
lmms-lab/LLaVA-OneVision-1.5-4B-Instruct
Image-Text-to-Text • 5B • Updated • 3.07k • 18
MMSearch-R1 is a solution designed to train LMMs to perform on-demand multimodal search in real-world environment.
CVPR 2025 - EgoLife: Towards Egocentric Life Assistant. Homepage: https://egolife-ai.github.io/
The collection of the sae that hooked on llava
Models focus on video understanding (previously known as LLaVA-NeXT-Video).
Dataset Collection of LMMs-Eval
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Paper • 2407.07895 • Published • 42 -
lmms-lab/llava-next-interleave-qwen-7b
Text Generation • 8B • Updated • 159 • 27 -
lmms-lab/llava-next-interleave-qwen-7b-dpo
Text Generation • 8B • Updated • 39 • 12 -
lmms-lab/M4-Instruct-Data
Updated • 1.28k • 79
Making Lite version of the dataset to accelerate holistic evaluation during model development!
HEVC-Style Vision Transformer
-
OpenMMReasoner/OpenMMReasoner-ColdStart
Image-Text-to-Text • 8B • Updated • 15 • 3 -
OpenMMReasoner/OpenMMReasoner-RL
Image-Text-to-Text • 8B • Updated • 34 • 17 -
OpenMMReasoner/OpenMMReasoner-SFT-874K
Viewer • Updated • 874k • 462 • 8 -
OpenMMReasoner/OpenMMReasoner-RL-74K
Viewer • Updated • 74.7k • 321 • 10
as a general evaluator for assessing model performance
a model good at arbitrary types of visual input
Long Context Transfer From Text To Vision: https://lmms-lab.github.io/posts/longva/
Some powerful image models.
HEVC-Style Vision Transformer
-
OpenMMReasoner/OpenMMReasoner-ColdStart
Image-Text-to-Text • 8B • Updated • 15 • 3 -
OpenMMReasoner/OpenMMReasoner-RL
Image-Text-to-Text • 8B • Updated • 34 • 17 -
OpenMMReasoner/OpenMMReasoner-SFT-874K
Viewer • Updated • 874k • 462 • 8 -
OpenMMReasoner/OpenMMReasoner-RL-74K
Viewer • Updated • 74.7k • 321 • 10
https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5
-
mvp-lab/LLaVA-OneVision-1.5-Instruct-Data
Viewer • Updated • 21.9M • 92.9k • 75 -
mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M
Viewer • Updated • 91.5M • 147k • 71 -
lmms-lab/LLaVA-OneVision-1.5-8B-Instruct
Image-Text-to-Text • 9B • Updated • 74.9k • 64 -
lmms-lab/LLaVA-OneVision-1.5-4B-Instruct
Image-Text-to-Text • 5B • Updated • 3.07k • 18
MMSearch-R1 is a solution designed to train LMMs to perform on-demand multimodal search in real-world environment.
CVPR 2025 - EgoLife: Towards Egocentric Life Assistant. Homepage: https://egolife-ai.github.io/
The collection of the sae that hooked on llava
as a general evaluator for assessing model performance
Models focus on video understanding (previously known as LLaVA-NeXT-Video).
a model good at arbitrary types of visual input
Dataset Collection of LMMs-Eval
Long Context Transfer From Text To Vision: https://lmms-lab.github.io/posts/longva/
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Paper • 2407.07895 • Published • 42 -
lmms-lab/llava-next-interleave-qwen-7b
Text Generation • 8B • Updated • 159 • 27 -
lmms-lab/llava-next-interleave-qwen-7b-dpo
Text Generation • 8B • Updated • 39 • 12 -
lmms-lab/M4-Instruct-Data
Updated • 1.28k • 79
Some powerful image models.
Making Lite version of the dataset to accelerate holistic evaluation during model development!