SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

Binyuan Huang^1*†, Yuqing Wen^2*†, Yucheng Zhao^3*, Yaosi Hu^4*, Yingfei Liu³, Fan Jia³, Weixin Mao³, Tiancai Wang^3‡, Chi Zhang⁵, Chang Wen Chen⁴, Zhenzhong Chen¹, Xiangyu Zhang³

¹Wuhan University ²University of Science and Technology of China ³MEGVII Technology ⁴The Hong Kong Polytechnic University ⁵Mach Drive

^*Equal Contribution ^†This work was done during the internship at MEGVII ^‡Corresponding author

🔥 Stronger Generative Scalability for Autonomous Driving 🔥

Overview of the proposed SubjectDrive framework and its effectiveness in enhancing BEV perception tasks. (a) Traditional data generation framework that uses the control sequence and sampling noise to generate synthetic data. (b) Compared with the traditional framework, our SubjectDrive introduces additional synthesis diversity by incorporating extra subject control. (c)-(d) Evaluation of detection and tracking performance with data scaling. (e) Illustration of using the SubjectDrive framework to produce perception training data in autonomous driving.

Controllable and Multi-View Generative Framework With Subject Control For Autonomous Driving

Detailed Content of SubjectDrive. (a). The diffusion training process of SubjectDrive, enabled by a diffusion encoder and decoder with the decomposed 4D attention module. (b). The decomposed 4D attention module comprises three components: intra-view attention for spatial processing within individual views, cross-view attention to engage with adjacent views, and cross-frame attention for temporal processing. (c). Controllable module for the integration of diverse signals. The image conditions are derived from a frozen VAE encoder and combined with diffused noises. The text prompts are processed through a frozen CLIP encoder, while BEV sequences are handled via ControlNet. (d). The details of BEV layout sequences, including projected bounding boxes, object depths, road maps and camera pose.

🎬 Subject-Controlled Video Generation 🎬

Generate subject-controlled videos by SubjectDrive. Given the image of a reference subject, SubjectDrive can generate layout-aligned driving videos featuring the desired subject. By using reference subjects as control signals, SubjectDrive offers a mechanism for incorporating external diversity into the generated data.

🎬 Controllable Video Generation 🎬

Controllable multi-view videos generated by SubjectDrive. From this visualization, it is evident that our generated synthetic data closely aligns with the specified BEV conditions, showcasing superior layout control and alignment capabilities.

🎬 Consistent Multi-View Video Generation 🎬

Multi-view videos generated by SubjectDrive. For the six-view, eight-frame generated video, SubjectDrive produces temporally and view-consistent videos on the nuScenes validation set.