Binyuan Huang1*†, Yuqing Wen2*†, Yucheng Zhao3*, Yaosi Hu4*, Yingfei Liu3 , Fan Jia3, Weixin Mao3, Tiancai Wang3‡, Chi Zhang5, Chang Wen Chen4, Zhenzhong Chen1, Xiangyu Zhang3
1Wuhan University 2University of Science and Technology of China 3MEGVII Technology 4The Hong Kong Polytechnic University 5Mach Drive
*Equal Contribution †This work was done during the internship at MEGVII ‡Corresponding author
Overview of the proposed SubjectDrive framework and its effectiveness in enhancing BEV perception tasks. (a) Traditional data generation framework that uses the control sequence and sampling noise to generate synthetic data. (b) Compared with the traditional framework, our SubjectDrive introduces additional synthesis diversity by incorporating extra subject control. (c)-(d) Evaluation of detection and tracking performance with data scaling. (e) Illustration of using the SubjectDrive framework to produce perception training data in autonomous driving.
Detailed Content of SubjectDrive. (a). The diffusion training process of SubjectDrive, enabled by a diffusion encoder and decoder with the decomposed 4D attention module. (b). The decomposed 4D attention module comprises three components: intra-view attention for spatial processing within individual views, cross-view attention to engage with adjacent views, and cross-frame attention for temporal processing. (c). Controllable module for the integration of diverse signals. The image conditions are derived from a frozen VAE encoder and combined with diffused noises. The text prompts are processed through a frozen CLIP encoder, while BEV sequences are handled via ControlNet. (d). The details of BEV layout sequences, including projected bounding boxes, object depths, road maps and camera pose.
Generate subject-controlled videos by SubjectDrive. Given the image of a reference subject, SubjectDrive can generate layout-aligned driving videos featuring the desired subject. By using reference subjects as control signals, SubjectDrive offers a mechanism for incorporating external diversity into the generated data.
Controllable multi-view videos generated by SubjectDrive. From this visualization, it is evident that our generated synthetic data closely aligns with the specified BEV conditions, showcasing superior layout control and alignment capabilities.
Multi-view videos generated by SubjectDrive. For the six-view, eight-frame generated video, SubjectDrive produces temporally and view-consistent videos on the nuScenes validation set.
Feel free to contact us at huangbinyuan AT megvii.com or wangtiancai AT megvii.com