OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence

Overview

Figure 1. Left: Overview of the 4D Occupancy Spatio-Temporal Persistence (OccSTeP) pipeline. For the first time, four challenging driving scenarios {Reverse, Discontinuous, Fragmentary, Reductive} are involved for benchmarking two tasks: (1) reactive forecasting "what will happen next"; (2) proactive forecasting "what would happen given a specific future action (e.g., turn left)". Right: The comparison results show that our OccSTeP-WM obtains more robust performance.

Key Contributions

4D Occupancy Spatio-Temporal Persistence (OccSTeP) Benchmark: A new task and benchmark with challenging adverse scenarios including reverse driving, discontinuous sequences, fragmentary observations, and semantic noise.
OccSTeP-WM: An efficient tokenizer-free world model with spatio-temporal priors fusion module that addresses both reactive and proactive forecasting via recurrent state-space fusion.
State-of-the-Art Performance: Achieving semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain), demonstrating the effectiveness of persistent occupancy forecasting.

Main Results

23.70% Semantic mIoU +6.56%

35.89% Occupancy IoU +9.26%

Update these numbers/claims to match your final paper. Consider adding a small table image for key benchmarks.

Downloads

Code: coming soon
Dataset: coming soon
Pretrained Models: coming soon

Abstract

Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce 4D Occupancy Spatio-Temporal Persistence (OccSTeP), addressing two tasks: (1) reactive forecasting ("what will happen next") and (2) proactive forecasting ("what would happen given a specific future action"). We create a new OccSTeP benchmark with challenging scenarios including erroneous semantic labels and dropped frames. OccSTeP-WM is a tokenizer-free world model maintaining dense voxel-based scene state with linear-complexity attention and recurrent state-space modules for long-range spatial dependencies and continual scene memory updates. The design enables online inference and robustness to missing or noisy sensor input, achieving semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). Data and code are open-sourced.

Method

Figure 2. Architecture of OccSTeP-WM, our tokenizer-free world model with spatio-temporal priors fusion module for 4D occupancy prediction.

Citation

If you find our work useful, please cite:

@article{zheng2025occstep,
  title={OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence},
  author={Zheng, Yu and Hu, Jie and Yang, Kailun and Zhang, Jiaming},
  journal={arXiv preprint arXiv:2512.15621},
  year={2025}
}

Authors

Yu Zheng

Jie Hu

Kailun Yang

Jiaming Zhang

Contact

For questions, please email yzheng@hnu.edu.cn or open an issue on GitHub.

Maintained by the OccSTeP authors. Last updated: .