VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

1UCLA 2Microsoft 3UCSC 4USC
*Denotes Equal Contribution

Abstract

Vision-language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts—abilities essential for robust real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs’ spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

Spatiotemporal (4D) Awareness.

Figure 1. Spatiotemporal (4D) Awareness. Humans intuitively reason in 4D (3D space + time), effortlessly reconstructing the dynamic spatial trajectory of moving objects from any perspective. In contrast, current Vision Language Models (VLMs) typically rely on aggregating 2D visual features across time, leading to incorrect predictions when motion understanding and interpretation requires deeper spatiotemporal reasoning. In this example, humans correctly perceive the car moving to the right, while the VLM (GPT-4o) inaccurately predicts leftward movement, suggesting VLMs struggle to perform spatiotemporal reasoning.

Distribution of Dataset Sources and Annotations.

Figure 2. Distribution of Dataset Sources and Annotations. Breakdown of our dataset illustrating the proportions of data sourced from third-person (Davis, YouTube), first-person (Ego4D), and synthetic data, categorized by annotation types: translational, rotational, action, counting, and false positives.

Dataset Generation and Annotation

Figure 3. Dataset Generation and Annotation Pipeline. Our dataset was constructed by collecting real videos and generating synthetic data, followed by human-in-the-loop quality reviews to address ambiguous videos and annotations. After temporal alignment and quality assurance, human-annotated questions and answers were created, complemented by multiple-choice questions generated by large language models (LLMs). The final dataset includes real-world and synthetic video data with comprehensive VLM scoring metrics.

Qualitative Examples of Dataset Annotations.

Figure 4. (Top) A third-person video with translational annotations ("camel turning left from its perspective"). (Middle) A first-person video with a rotational question ("clockwise rotation of ladle"). (Bottom) A synthetic scene with action recognition "robotic dog moving left".

Model Performance

Figure 5. Model Accuracy Across Real Scene Question Categories of top-performing VLMs (left). Comparison of CoT and DO Accuracy Across Models (right).

Leaderboard

Evaluation on VLM4D Benchmark across various proprietary and open-source VLMs.

Organization Model Release Ego-centric Exo-centric Real Avg Directional FP Synth Avg Overall
User StudyHuman Performance99.699.799.791.8100.095.998.3
RandomRandom Selection24.423.223.625.524.725.124.2
OpenAIGPT-4o2024-1154.361.258.947.843.045.453.9
GoogleGemini 2.0 Pro2025-244.850.548.742.841.842.346.3
xAIGrok-2-Vision2024-1244.148.847.349.060.554.850.0
MetaLlama-3.2-11B-Vision2024-935.236.135.838.355.847.039.9
MicrosoftPhi-3.5-Vision2024-736.339.138.226.537.532.035.9
DeepSeekDeepSeek-VL2-Tiny2024-1231.432.532.242.825.534.132.9
Shanghai AI LabInternVL2.5-38B2024-1142.853.249.737.555.546.548.6
Shanghai AI LabInternVL2.5-8B2024-1140.841.141.040.847.043.942.1
Shanghai AI LabInternVL2-8B2024-633.238.236.534.858.046.440.2
Mistral AIPixtral-12B2024-936.332.934.041.017.329.132.2
RhymesAria2024-1142.344.043.535.356.345.844.3
HuggingFaceM4Idefics3-8B2024-834.336.235.633.547.340.437.4
H2OH2OVL-Mississippi-2B2024-1037.033.334.527.341.034.134.4
AlibabaQwen2.5-VL-7B2025-142.345.044.139.348.543.944.0
AlibabaQwen2.5-VL-72B-AWQ2025-149.948.749.154.372.863.554.4
AlibabaQwen2-VL-7B2024-836.138.237.538.540.339.438.2
AlibabaQwen2-VL-72B-AWQ2024-943.046.245.243.871.057.449.7
DAMOVideoLLama3-2B2025-148.643.745.329.069.849.446.8
Open-Source Proprietary

More models are coming.

BibTeX

@article{zhou2025vlm4d,
    title={VLM4D: Towards Spatiotemporal Awareness in Vision Language Models},
    author={Zhou, Shijie and Vilesov, Alexander and He, Xuehai and Wan, Ziyu and Zhang, Shuwang and Nagachandra, Aditya and Chang, Di and Chen, Dongdong and Wang, Eric Xin and Kadambi, Achuta},
    year={2025}
}