DiffPlace: Place-Controllable Street View Generation

Abstract

Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks.

DiffPlace introduces a place-ID controller that maps place embeddings into a fixed CLIP space through linear projection, a perceiver transformer, and contrastive learning. This lets the model synthesize scenes with consistent background buildings while flexibly changing foreground objects and weather.

Experiments show that DiffPlace improves both generation quality and augmented training support for visual place recognition, making scene-level controllable synthesis more useful for autonomous driving.

Method

Place-ID controller inside a multi-view diffusion pipeline

Overview of the DiffPlace pipeline — The generator combines map, box, text, camera, and place-ID features to produce multi-view images with controllable background and foreground elements.

What changes relative to prior work

Uses a dedicated place-ID encoder instead of overloading text prompts.
Aligns place embeddings with CLIP space for stable conditioning.
Preserves background identity while still supporting object and weather edits.
Targets generation that is useful for downstream place recognition training.

Results

Better place controllability, enhanced training for place recognition

57.6 AR@1

On generated validation scenes, beating MagicDrive by 21.7 points.

75.4 AR@5

Best place recognition consistency among compared generation methods.

13.4 FID

Competitive realism while improving background-level controllability.

+5.5 MixVPR AR@1 gain

Augmented training improvement on Pitts30k-test over no synthetic data.

Qualitative realism and controllability comparison — DiffPlace generates backgrounds that are more recognizable and place-consistent than BEVGen, MagicDrive, and DualDiff.

Validation performance on generated images

Method	FID	AR@1	AR@5
BEVGen	25.6	31.2	60.8
BEVControl	24.8	-	-
MagicDrive	16.2	35.9	64.1
DualDiff	11.0	48.7	68.9
DiffPlace	13.4	57.6	75.4

Augmented training support on Pitts30k-test

Training data	MixVPR AR@1	MixVPR AR@5	CricaVPR AR@1	CricaVPR AR@5
No synthetic data	83.5	90.3	90.9	96.0
MagicDrive	84.2	91.1	90.3	95.7
DiffPlace	89.7	95.2	92.9	96.8

Place controllability under weather and object edits — DiffPlace keeps place identity stable even when weather and objects change.

Attention map extracted by CricaVPR — After augmented training, attention shifts toward buildings and stable place cues.

Citation

BibTeX

Reference

Li, J., Li, Z., Li, S., Yu, Z., Wang, B., and Liu, H. DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition. arXiv:2602.11875, 2026.

Open ArXiv Open Code

BibTeX

@article{li2026diffplace,
  title={DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition},
  author={Li, Ji and Li, Zhiwei and Li, Shihao and Yu, Zhenjiang and Wang, Boyang and Liu, Haiou},
  journal={arXiv preprint arXiv:2602.11875},
  year={2026},
  url={https://arxiv.org/abs/2602.11875}
}