Skip to the content.

đź‘‹ Welcome to the Robust and Safe Embodied Intelligence in Challenging Scenarios organized at đź‘‹ WACV 2026

Workshop / Challenge Info:

Challenge Overview Figure

đź“„ Paper

Reference Paper (arXiv): Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models


📌 Overview

The Reality Gap: Autonomous driving systems have achieved remarkable proficiency in structured environments such as urban centers and highways. However, the “last mile” of truly ubiquitous autonomy lies in the ability to navigate unstructured scenarios—the chaotic, unpredictable, and “corner case” environments where current models frequently falter.

The Challenge: To bridge this gap, this challenge leverages the Impromptu VLA Dataset, a large-scale collection of roughly 80,000 clips curated specifically to capture diverse and challenging unstructured scenarios. Participants are invited to develop Vision-Language-Action (VLA) models that can not only perceive and reason about complex environments but also generate safe and accurate planning trajectories.

Challenge Overview Figure

🎯 Task

The challenge is structured around a Planning-Oriented Question-Answering (Q&A) format. Models must process multi-view images to generate text-based reasoning and vector-based trajectory outputs.

1. Scene Understanding & Perception (Q&A):

2. Reasoning & Prediction (Q&A):

3. End-to-End Trajectory Planning:


⚙️ Evaluation

Evaluation is conducted on the Impromptu VLA Validation Set. Performance is assessed using two primary categories of metrics:

1. Perception & Reasoning Metrics (Accuracy) For text-generation tasks, we calculate exact match accuracy against the ground truth (Higher is better).

2. Action Metric (Trajectory L2 Error) For the end-to-end planning task, we measure the Euclidean distance between predicted waypoints and ground truth (Lower is better).


📚 Recommended Readings & Citations

Participants are encouraged to read and cite the following work:

```bibtex @article{chi2025impromptu, title={Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models}, author={Chi, Haohan and Gao, Huan-ang and Liu, Ziming and Liu, Jianing and Liu, Chenyu and Li, Jinwei and Yang, Kaisen and Yu, Yangcheng and Wang, Zeda and Li, Wenyi and others}, journal={arXiv preprint arXiv:2505.23757}, year={2025} }