Wednesday, 15/04/2026, 10:00
Xingjian Bai, PhD Student (supervised by Prof. Kaiming He), MIT
End-to-End Training for Unified Tokenization and Latent Denoising
Abstract
Training state-of-the-art latent diffusion models requires complex staging: a tokenizer must first be trained before a diffusion model can operate in its frozen latent space. We propose UNITE — an architecture for unified tokenization and latent diffusion. A single Generative Encoder serves as both image tokenizer and latent generator via weight sharing, trained in a single stage that jointly optimizes both tasks. UNITE learns a common latent language for tokenization and generation.

