PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published
• 72
Vision language models are blind
Paper
• 2407.06581
• Published
• 85
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published
• 49
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts
Language Model
Paper
• 2405.04434
• Published
• 25
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
• 2404.19752
• Published
• 24
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published
• 30
Sigmoid Loss for Language Image Pre-Training
Paper
• 2303.15343
• Published
• 11
CogVLM: Visual Expert for Pretrained Language Models
Paper
• 2311.03079
• Published
• 27
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
• 2401.16420
• Published
• 55
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published
• 47
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published
• 158