Evaluation on VLM4D Benchmark across various proprietary and open-source VLMs.
Organization | Model | Release | Ego-centric | Exo-centric | Real Avg | Directional | FP | Synth Avg | Overall |
---|---|---|---|---|---|---|---|---|---|
User Study | Human Performance | 99.6 | 99.7 | 99.7 | 91.8 | 100.0 | 95.9 | 98.3 | |
Random | Random Selection | 24.4 | 23.2 | 23.6 | 25.5 | 24.7 | 25.1 | 24.2 | |
Gemini-2.5-Pro | 2025-03 | 68.2 | 70.5 | 69.7 | 71.3 | 75.0 | 71.6 | 70.2 | |
Anthropic | Claude-3.7-Sonnet | 2025-02 | 51.2 | 65.0 | 60.5 | 45.3 | 93.3 | 50.1 | 57.9 |
OpenAI | GPT-4o | 2024-11 | 54.3 | 61.2 | 58.9 | 47.8 | 47.5 | 47.7 | 56.2 |
Meta | Llama-4-Maverick-17B | 2025-04 | 48.8 | 52.9 | 51.6 | 56.0 | 50.0 | 55.5 | 52.5 |
Alibaba | Qwen2.5-VL-72B-AWQ | 2025-01 | 49.9 | 48.7 | 49.1 | 54.3 | 75.0 | 56.1 | 50.8 |
OpenGVLab | InternVideo2.5-8B | 2025-01 | 52.8 | 50.1 | 51.0 | 45.3 | 32.5 | 44.1 | 49.3 |
Meta | Llama-4-Scout-17B | 2025-04 | 46.6 | 51.3 | 49.7 | 46.8 | 45.0 | 46.6 | 49.0 |
xAI | Grok-2-Vision | 2024-12 | 44.1 | 48.8 | 47.3 | 49.0 | 75.0 | 51.4 | 48.3 |
Gemini-2.0-Pro | 2025-02 | 44.8 | 50.5 | 48.7 | 42.8 | 52.5 | 43.6 | 47.4 | |
Shanghai AI Lab | InternVL2.5-38B | 2024-11 | 42.8 | 53.2 | 49.7 | 37.5 | 62.5 | 39.8 | 47.3 |
Alibaba | Qwen2-VL-72B-AWQ | 2024-09 | 43.0 | 46.2 | 45.2 | 43.8 | 75.0 | 46.6 | 45.5 |
DAMO | VideoLLama3-7B | 2025-01 | 47.4 | 45.0 | 45.8 | 39.5 | 60.0 | 41.4 | 44.7 |
Alibaba | Qwen2.5-VL-7B | 2025-01 | 42.3 | 45.0 | 44.1 | 39.3 | 55.0 | 40.7 | 43.3 |
DAMO | VideoLLama3-2B | 2025-01 | 48.6 | 43.7 | 45.3 | 29.0 | 60.0 | 31.8 | 42.2 |
Rhymes | Aria | 2024-11 | 42.3 | 44.0 | 43.5 | 35.3 | 57.5 | 37.3 | 42.0 |
Shanghai AI Lab | InternVL2.5-8B | 2024-11 | 40.8 | 41.1 | 41.0 | 40.8 | 55.0 | 42.1 | 41.3 |
Meta | Llama-3.2-90B-Vision | 2024-09 | 37.4 | 42.4 | 40.8 | 28.0 | 85.0 | 33.2 | 38.9 |
Alibaba | Qwen2-VL-7B | 2024-08 | 36.1 | 38.2 | 37.5 | 38.5 | 37.5 | 38.4 | 37.7 |
OpenGVLab | InternVideo2-8B | 2024-08 | 37.2 | 37.9 | 37.6 | 40.5 | 0.0 | 36.8 | 37.4 |
Meta | Llama-3.2-11B-Vision | 2024-09 | 35.2 | 36.1 | 35.8 | 38.3 | 62.5 | 40.5 | 36.9 |
Shanghai AI Lab | InternVL2-8B | 2024-06 | 33.2 | 38.2 | 36.5 | 34.8 | 72.5 | 38.2 | 36.9 |
DAMO | VideoLLama2.1-7B | 2024-10 | 43.0 | 36.0 | 38.2 | 31.5 | 40.0 | 32.3 | 36.8 |
Microsoft | Phi-4-Multimodal | 2025-03 | 39.9 | 36.0 | 37.3 | 34.8 | 2.5 | 31.8 | 36.0 |
Microsoft | Phi-3.5-Vision | 2024-07 | 36.3 | 39.1 | 38.2 | 26.5 | 42.5 | 28.0 | 35.7 |
HuggingFaceM4 | Idefics3-8B | 2024-08 | 34.3 | 36.2 | 35.6 | 33.5 | 42.5 | 34.3 | 35.3 |
LLaVA | LLaVA-NeXT-Video-34B | 2024-06 | 37.2 | 34.9 | 35.7 | 31.5 | 60.0 | 34.1 | 35.3 |
Mistral AI | Pixtral-12B | 2024-09 | 36.3 | 32.9 | 34.0 | 41.0 | 17.5 | 38.9 | 35.2 |
DeepSeek | DeepSeek-VL2-Tiny | 2024-12 | 31.4 | 32.5 | 32.2 | 42.8 | 15.0 | 40.2 | 34.1 |
LLaVA | LLaVA-One-Vision-7B | 2024-09 | 32.5 | 33.1 | 32.9 | 32.8 | 45.0 | 33.9 | 33.1 |
H2O | H2OVL-Mississippi-2B | 2024-10 | 37.0 | 33.3 | 34.5 | 27.3 | 27.5 | 27.3 | 32.7 |
LLaVA | LLaVA-NeXT-Video-7B | 2024-06 | 30.3 | 30.9 | 30.7 | 24.5 | 25.0 | 24.6 | 29.2 |
DAMO | VideoLLama2-7B | 2024-06 | 36.3 | 16.5 | 23.0 | 25.8 | 37.5 | 26.8 | 23.9 |
More models are coming.