Model Evaluation
updated
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Paper
• 2502.07445
• Published
• 11
ARR: Question Answering with Large Language Models via Analyzing,
Retrieving, and Reasoning
Paper
• 2502.04689
• Published
• 8
Analyze Feature Flow to Enhance Interpretation and Steering in Language
Models
Paper
• 2502.03032
• Published
• 60
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Paper
• 2502.01534
• Published
• 40
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Paper
• 2502.01639
• Published
• 26
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
• 2502.09621
• Published
• 28
Logical Reasoning in Large Language Models: A Survey
Paper
• 2502.09100
• Published
• 24
IHEval: Evaluating Language Models on Following the Instruction
Hierarchy
Paper
• 2502.08745
• Published
• 20
InductionBench: LLMs Fail in the Simplest Complexity Class
Paper
• 2502.15823
• Published
• 7
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
Paper
• 2503.04504
• Published
• 5
Feature-Level Insights into Artificial Text Detection with Sparse
Autoencoders
Paper
• 2503.03601
• Published
• 232
Collapse of Dense Retrievers: Short, Early, and Literal Biases
Outranking Factual Evidence
Paper
• 2503.05037
• Published
• 4
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
Paper
• 2503.12349
• Published
• 44
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published
• 35
VERIFY: A Benchmark of Visual Explanation and Reasoning for
Investigating Multimodal Reasoning Fidelity
Paper
• 2503.11557
• Published
• 22
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the
LLM Era
Paper
• 2503.12329
• Published
• 27
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
• 2503.13891
• Published
• 8
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published
• 119
MORSE-500: A Programmatically Controllable Video Benchmark to
Stress-Test Multimodal Reasoning
Paper
• 2506.05523
• Published
• 34
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo
Retouching Agent
Paper
• 2506.17612
• Published
• 64
Where to find Grokking in LLM Pretraining? Monitor
Memorization-to-Generalization without Test
Paper
• 2506.21551
• Published
• 28
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic
Empirical Study
Paper
• 2506.19794
• Published
• 8
Do Vision-Language Models Have Internal World Models? Towards an Atomic
Evaluation
Paper
• 2506.21876
• Published
• 28
Machine Bullshit: Characterizing the Emergent Disregard for Truth in
Large Language Models
Paper
• 2507.07484
• Published
• 18
Hidden in plain sight: VLMs overlook their visual representations
Paper
• 2506.08008
• Published
• 7
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in
Text-to-Video Models
Paper
• 2507.13428
• Published
• 16
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Paper
• 2507.12806
• Published
• 21
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video
Reasoning and Understanding
Paper
• 2507.15028
• Published
• 21
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
• 2507.16863
• Published
• 69
AgroBench: Vision-Language Model Benchmark in Agriculture
Paper
• 2507.20519
• Published
• 8
Are We on the Right Way for Assessing Document Retrieval-Augmented
Generation?
Paper
• 2508.03644
• Published
• 25
PRELUDE: A Benchmark Designed to Require Global Comprehension and
Reasoning over Long Contexts
Paper
• 2508.09848
• Published
• 71
A Survey on Large Language Model Benchmarks
Paper
• 2508.15361
• Published
• 20
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
• 2509.01396
• Published
• 58
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
• 2509.04013
• Published
• 4
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for
Reasoning-Intensive Multimodal Retrieval
Paper
• 2510.09510
• Published
• 8
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large
Vision and Language Models
Paper
• 2510.16641
• Published
• 5
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
• 2510.26802
• Published
• 34
Multimodal Spatial Reasoning in the Large Model Era: A Survey and
Benchmarks
Paper
• 2510.25760
• Published
• 17
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
Paper
• 2511.01295
• Published
• 39
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Paper
• 2511.21750
• Published
• 6
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
Paper
• 2512.02622
• Published
• 10
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
• 2512.04324
• Published
• 154
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Paper
• 2602.02185
• Published
• 125