-
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
Paper • 2401.09048 • Published • 10 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 18 -
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Paper • 2401.10891 • Published • 62 -
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
Paper • 2401.13627 • Published • 78
Collections
Discover the best community collections!
Collections including paper arxiv:2504.05979
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper • 2504.01990 • Published • 305 -
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 308 -
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Paper • 2503.24235 • Published • 55 -
Seedream 3.0 Technical Report
Paper • 2504.11346 • Published • 70
-
An Empirical Study of GPT-4o Image Generation Capabilities
Paper • 2504.05979 • Published • 64 -
Antidistillation Sampling
Paper • 2504.13146 • Published • 59 -
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Paper • 2504.13169 • Published • 39 -
WORLDMEM: Long-term Consistent World Simulation with Memory
Paper • 2504.12369 • Published • 35
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 -
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 20 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
ReZero: Enhancing LLM search ability by trying one-more-time
Paper • 2504.11001 • Published • 16 -
FonTS: Text Rendering with Typography and Style Controls
Paper • 2412.00136 • Published • 1 -
GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 98 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 163
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121 -
DreamO: A Unified Framework for Image Customization
Paper • 2504.16915 • Published • 24 -
An Empirical Study of GPT-4o Image Generation Capabilities
Paper • 2504.05979 • Published • 64
-
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
Paper • 2401.09048 • Published • 10 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 18 -
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Paper • 2401.10891 • Published • 62 -
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
Paper • 2401.13627 • Published • 78
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 20 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper • 2504.01990 • Published • 305 -
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 308 -
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Paper • 2503.24235 • Published • 55 -
Seedream 3.0 Technical Report
Paper • 2504.11346 • Published • 70
-
ReZero: Enhancing LLM search ability by trying one-more-time
Paper • 2504.11001 • Published • 16 -
FonTS: Text Rendering with Typography and Style Controls
Paper • 2412.00136 • Published • 1 -
GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 98 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 163
-
An Empirical Study of GPT-4o Image Generation Capabilities
Paper • 2504.05979 • Published • 64 -
Antidistillation Sampling
Paper • 2504.13146 • Published • 59 -
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Paper • 2504.13169 • Published • 39 -
WORLDMEM: Long-term Consistent World Simulation with Memory
Paper • 2504.12369 • Published • 35
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121 -
DreamO: A Unified Framework for Image Customization
Paper • 2504.16915 • Published • 24 -
An Empirical Study of GPT-4o Image Generation Capabilities
Paper • 2504.05979 • Published • 64
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 -
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30