Vec-QMDP: Vectorized POMDP Planning on CPUs for Real-Time Autonomous Driving

Abstract

Planning under uncertainty for real-world robotics tasks, such as autonomous driving, requires reasoning in enormous high-dimensional belief spaces, rendering the problem computationally intensive. While parallelization offers scalability, existing hybrid CPU-GPU solvers face critical bottlenecks due to host-device synchronization latency and branch divergence on SIMT architectures, limiting their utility for real-time planning and hindering real-robot deployment. We present Vec-QMDP, a CPU-native parallel planner that aligns POMDP search with modern CPUs' SIMD architecture, achieving 227× - 1073× speedup over state-of-the-art serial planners. Vec-QMDP adopts a Data-Oriented Design (DOD), refactoring scattered, pointer-based data structures into contiguous, cache-efficient memory layouts. We further introduce a hierarchical parallelism scheme: distributing sub-trees across independent CPU cores and SIMD lanes, enabling fully vectorized tree expansion and collision checking. Efficiency is maximized with the help of UCB load balancing across trees and a vectorized STR-tree for coarse-level collision checking. Evauated on large-scale autonomous driving benchmarks, Vec-QMDP achieves state-of-the-art planning performance with millisecond-level latency, establishing CPUs as a high-performance computing platform for large-scale planning under uncertainty.

Highlights

FIRST CPU-Native Vectorized POMDP Planner that fully exploits CPU parallelism, combining multi-core execution with SIMD vectorization, eliminating GPU synchronization overhead.
Global & Local Vectorization batches transition dynamics across scenarios (global) and parallelizes multi-agent collision checking (local), enabling efficient forward simulation in complex environments.
Data-Oriented Design (DOD) transforms pointer-based structures into contiguous memory layouts (SoA), unlocking high SIMD utilization and cache efficiency.
Load-Balancing UCB aligns expansion depths across scenario trees to minimize SIMD divergence and maximize hardware utilization.
Vectorized Trajectory Optimization refines robust driving trajectories under scenario uncertainty by employing importance sampling and block-diagonal cross-scenario evaluation.

⚡ Massive Speedup
vs. state-of-the-art POMDP planner
227× - 1073×
tree construction throughput, scaling with traffic density

⏱️ Millisecond Planning
real-time decision making
9ms / 14ms
match / peak performance under tight latency budgets

🛠️ CPU Parallelism
no GPU required
Multi-Core × SIMD
hierarchical parallelization with global & local vectorization

🏆 nuPlan SOTA
closed-loop driving performance
94.36 / 93.22
Val14 NR / R, outperforming learning and hybrid methods

Overview

Vec-QMDP scales up a state-of-the-art POMDP planner Hi-Drive for autonomous driving by leveraging SIMD parallelism, demonstrating how belief tree search and belief-space trajectory optimization can be extensively vectorized for robotics tasks in complex dynamic environments. (a) Sample the belief into M × N scenarios in an Structure of arrays (SoA) layout. (b) Vectorized QMDP search: after the first action, scenario trees run in parallel on M CPU threads; within each thread, SIMD global vectorization batches transition dynamics across scenarios and SIMD local vectorization accelerates within-node collision checks. (c) Vectorized trajectory optimization: generate candidates and use block-diagonal cross-scenario evaluation within minibatches to select optimal trajectory.

Comparison Results

Driving Performance Comparison on nuPlan

Type	Planner	Val14		Test14-random		Test14-hard		Inference / Planning Time (ms) ↓
Type	Planner	R	NR	R	NR	R	NR	Inference / Planning Time (ms) ↓
Expert	Log-replay	80.32	93.53	75.86	94.03	68.80	85.96	-
Learning- based	PLUTO	78.11	88.89	78.62	89.90	59.74	70.03	-
Learning- based	Diffusion Planner	82.80	89.87	82.93	89.19	69.22	75.99	80
Hybrid	PDM-Hybrid	92.11	92.77	91.28	90.10	76.07	65.99	171
	PLUTO w/ refine.	76.88	92.88	90.29	92.23	76.88	80.08	-
	Diff. Planner w/ refine.	92.90	94.26	91.75	94.80	82.00	78.87	>80
Model- based	HiDrive	93.15	93.62	92.31	93.71	83.18	81.41	92
	VecQMDP (match, Ours)	93.15^±0.11	94.16^±0.03	92.51^±0.00	95.21^±0.00	84.23^±0.35	82.30^±0.48	9
	VecQMDP (best, Ours)	93.22^±0.06	94.36^±0.02	93.04^±0.05	95.21^±0.00	84.23^±0.35	82.84^±0.11	14

Comparison results of VecQMDP and state-of-the-art methods on nuPlan dataset.
Bold indicates best; underscored indicates second-best. Values show mean ± standard error. NR: non-reactive mode. R: reactive mode.

Computational Throughput Comparison

Throughput comparison: edges per millisecond vs traffic density — **(a)** Throughput (edges/ms)

Speedup comparison over serial HiDrive — **(b)** Speedup over HiDrive

Tree construction throughput. (Left) Edges/ms vs. traffic density. (Right) Speedup over serial HiDrive (227×-1073×), increasing with density.

Qualitative Results

BibTeX


            @article{
            jin2026vec,
            title={Vec-QMDP: Vectorized POMDP Planning on CPUs for Real-Time Autonomous Driving},
            author={Jin, Xuanjin and Dong, Yanxin and Sun, Bin and Xu, Huan and Hao, Zhihui and Lang, XianPeng and Cai, Panpan},
            journal={arXiv preprint arXiv:2602.08334},
            year={2026},
            eprint={2602.08334},
            archivePrefix={arXiv},
            primaryClass={cs.RO}
            url={https://arxiv.org/abs/2602.08334}
            }