lmms-lab/LLaVA-OneVision-1.5-4B-Instruct
Image-Text-to-Text β’ 5B β’ Updated β’ 2.53k β’ 18
Feeling and building the multimodal intelligence.
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling