Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
A method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes.
I am currently a Marie Skłodowska-Curie Actions (MSCA) Fellow in VGG at the University of Oxford, working with Andrea Vedaldi on feed-forward photorealistic 3D and 4D reconstruction. I was very fortunate to also collaborate with Andrew Zisserman, Christian Rupprecht, and Iro Laina in VGG. I received my Ph.D. degree in Computer Science from the Nanyang Technological University (NTU), advised by Prof. Tat-Jen Cham and Prof. Jianfei Cai.
I am an incoming Nanyang Assistant Professor (starting from Fall 2025) at the College of Computing and Data Science, Nanyang Technological University, where I lead the Physical Visual Group. My research focuses on Creative AI, aiming to develop systems that perceive, reconstruct and interact with the physical world. The broader goal is to create realistic digital twins of the natural world, with various physical properties, capturing not only appearance, content, and geometry but also occlusion, dynamics, gravity, interaction, sound and more.
More Publications and Google Scholar
A method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes.
A feed-forward approach to increase the likelihood that the 3D generator outputs stable 3D objects directly.
A modal for completed 3D reconstruction from partially visible inputs.
An interactive video generative model that can serve as a motion prior for part-level dynamics.
A fast, super efficient, trainable on a single GPU in one day for scene 3D reconstruction from a single image.
Probe large vision models to determine to what extent they 'understand' different physical properties in an image.
A feed-forward approach for 360-degree scene-level novel view synthesis using only sparse observations.
A feed-forward approach for efficiently predicting 3D Gaussians from sparse multi-view images in a single forward pass.
A physical interaction with objects in vision for part-level dragging.
A method to synthesize consistent novel views from a single image on open-set categories without the need of explicit 3D representations.
An indoor panorama outpainting model using latent diffusion models with view-consistent.
A simple approach to avoid codebook collapse and achive 100% codebook utilisation.
A unified discrete diffusion model for simultaneous vision-language generation.
A spatially conditional normalization is introduced to address the repeated artifacts in vector quantized methods.
A model that transfers the 2D semantic map into 3D NeRF, and lets users edit 3D model through 2D semantic input.
TFill fills in reasonable contents for both foreground object removal and content completion.
A high-level scene understanding system that simultaneously models the completed shape and appearance for all instances.
A novel spatially-correlative loss that is simple, efficient and yet effective for preserving scene structure consistency while supporting large appearance changes during unpaired I2I translation.
Given a masked image, the proposed pic model is able to generate multiple and diverse plausible results.
Without using any real depth map, the proposed model evaluates depth maps on real scenes using only synthetic datasets.