ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents Paper β’ 2507.22827 β’ Published Jul 30, 2025 β’ 100
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations Paper β’ 2506.18898 β’ Published Jun 23, 2025 β’ 34
Multimodal Long Video Modeling Based on Temporal Dynamic Context Paper β’ 2504.10443 β’ Published Apr 14, 2025 β’ 3
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention Paper β’ 2303.16199 β’ Published Mar 28, 2023 β’ 4
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Paper β’ 2304.15010 β’ Published Apr 28, 2023 β’ 4
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Paper β’ 2502.16707 β’ Published Feb 23, 2025 β’ 13
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper β’ 2412.02611 β’ Published Dec 3, 2024 β’ 25
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Paper β’ 2410.13360 β’ Published Oct 17, 2024 β’ 9
OneLLM: One Framework to Align All Modalities with Language Paper β’ 2312.03700 β’ Published Dec 6, 2023 β’ 24
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Paper β’ 2311.07575 β’ Published Nov 13, 2023 β’ 15
ImageBind-LLM: Multi-modality Instruction Tuning Paper β’ 2309.03905 β’ Published Sep 7, 2023 β’ 18