Evaluation on VLM4D Benchmark across various proprietary and open-source VLMs.
Organization | Model | Release | Ego-centric | Exo-centric | Real Avg | Directional | FP | Synth Avg | Overall |
---|---|---|---|---|---|---|---|---|---|
User Study | Human Performance | 99.6 | 99.7 | 99.7 | 91.8 | 100.0 | 95.9 | 98.3 | |
Random | Random Selection | 24.4 | 23.2 | 23.6 | 25.5 | 24.7 | 25.1 | 24.2 | |
OpenAI | GPT-4o | 2024-11 | 54.3 | 61.2 | 58.9 | 47.8 | 43.0 | 45.4 | 53.9 |
Gemini 2.0 Pro | 2025-2 | 44.8 | 50.5 | 48.7 | 42.8 | 41.8 | 42.3 | 46.3 | |
xAI | Grok-2-Vision | 2024-12 | 44.1 | 48.8 | 47.3 | 49.0 | 60.5 | 54.8 | 50.0 |
Meta | Llama-3.2-11B-Vision | 2024-9 | 35.2 | 36.1 | 35.8 | 38.3 | 55.8 | 47.0 | 39.9 |
Microsoft | Phi-3.5-Vision | 2024-7 | 36.3 | 39.1 | 38.2 | 26.5 | 37.5 | 32.0 | 35.9 |
DeepSeek | DeepSeek-VL2-Tiny | 2024-12 | 31.4 | 32.5 | 32.2 | 42.8 | 25.5 | 34.1 | 32.9 |
Shanghai AI Lab | InternVL2.5-38B | 2024-11 | 42.8 | 53.2 | 49.7 | 37.5 | 55.5 | 46.5 | 48.6 |
Shanghai AI Lab | InternVL2.5-8B | 2024-11 | 40.8 | 41.1 | 41.0 | 40.8 | 47.0 | 43.9 | 42.1 |
Shanghai AI Lab | InternVL2-8B | 2024-6 | 33.2 | 38.2 | 36.5 | 34.8 | 58.0 | 46.4 | 40.2 |
Mistral AI | Pixtral-12B | 2024-9 | 36.3 | 32.9 | 34.0 | 41.0 | 17.3 | 29.1 | 32.2 |
Rhymes | Aria | 2024-11 | 42.3 | 44.0 | 43.5 | 35.3 | 56.3 | 45.8 | 44.3 |
HuggingFaceM4 | Idefics3-8B | 2024-8 | 34.3 | 36.2 | 35.6 | 33.5 | 47.3 | 40.4 | 37.4 |
H2O | H2OVL-Mississippi-2B | 2024-10 | 37.0 | 33.3 | 34.5 | 27.3 | 41.0 | 34.1 | 34.4 |
Alibaba | Qwen2.5-VL-7B | 2025-1 | 42.3 | 45.0 | 44.1 | 39.3 | 48.5 | 43.9 | 44.0 |
Alibaba | Qwen2.5-VL-72B-AWQ | 2025-1 | 49.9 | 48.7 | 49.1 | 54.3 | 72.8 | 63.5 | 54.4 |
Alibaba | Qwen2-VL-7B | 2024-8 | 36.1 | 38.2 | 37.5 | 38.5 | 40.3 | 39.4 | 38.2 |
Alibaba | Qwen2-VL-72B-AWQ | 2024-9 | 43.0 | 46.2 | 45.2 | 43.8 | 71.0 | 57.4 | 49.7 |
DAMO | VideoLLama3-2B | 2025-1 | 48.6 | 43.7 | 45.3 | 29.0 | 69.8 | 49.4 | 46.8 |
More models are coming.