Before my PhD, I received M.Sc. and B.Sc. degrees in Physics at the University of Tübingen. For my master's, I researched disentangled representation learning with Wieland Brendel and Matthias Bethge at Tübingen AI Center. During my bachelor's, I worked on 3D face reconstruction with Timo Bolkart and Michael J. Black at MPI-IS.
I was born and raised in Beijing.
I'm interested in machine learning, computer vision, and computer graphics—particularly in how we can understand and recreate the physical world in a principled and generalizable way. My research explores the underlying structures in visual data—from 3D/4D representations and group symmetries to symbolic abstractions, physical priors, and beyond—to enable more controllable and interpretable AI systems. I focus on inverse graphics and controllable generation, often investigating and leveraging foundation models.
Selected publications are listed below.
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng*,
Junyi Zhang*,
Qianqian Wang,
Yufei Ye,
Pengcheng Yu,
Michael J. Black,
Trevor Darrell,
Angjoo Kanazawa
(*Equal contribution, listed alphabetically)
Preprint, 2025
project page /
arXiv
What is the minimal representation required to model both geometry and correspondence of the 4D world? Could it be done in a unified framework? It turns out to be remarkably simple! We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate from monocular video.
GenLit: Reformulating Single-Image Relighting as Video Generation
Shrisha Bharadwaj*,
Haiwen Feng*,
Victoria Fernandez Abrevaya,
Michael J. Black
(*Equal contribution, listed alphabetically)
Preprint, 2025
project page /
arXiv
Could we reformulate single-image relighting as a video generation task? By leveraging video diffusion models and a small synthetic dataset, we achieve realistic relighting effects with cast shadows on real images without explicit 3D. Our work reveals the potential of foundation models in understanding physical properties and performing graphics tasks.
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Rick Akkerman*,
Haiwen Feng*ᐩ,
Michael J. Black,
Dimitrios Tzionas,
Victoria Fernandez Abrevaya
(*Equal contribution ᐩProject lead)
CVPR, 2025
project page /
arXiv
Can we generate physical interactions without physics simulation? We leverage video foundation models as implicit physics simulators. Given an initial frame and a control signal of a driving object, InterDyn generates plausible, temporally consistent videos of complex object interactions. Our work demonstrates the potential of using video generative models to understand and predict real-world physical dynamics.
We questioned whether LLMs can "imagine" how the corresponding graphics content would look without visually seeing it!
This task requires both low-level skills (e.g., counting objects, identifying colors) and high-level reasoning (e.g., interpreting affordances, understanding semantics).
Our benchmark effectively differentiates models by their reasoning abilities, with performance consistently aligning with the scaling law!
We explored if inverse graphics could be approached as a code generation task and found it generalize surprisingly well to OOD cases!
However, is it optimal for graphics? Our research identifies a fundamental limitation of LLMs for parameter estimation and offers a simple but effective solution.
We proposed "Time-Reversal Fusion" to enable the image-to-video model to generate towards a given end frame without any tuning. It not only provides a unified solution for three visual tasks but also probes the dynamic generation capability of the video diffusion model.
We proposed a principled PEFT method by orthogonally fine-tuning the pretrained model, resulting in superior alignment and faster convergence for controllable synthesis.
We extended SE(3) Equivariance to articulated scenarios, achieving principled generalization for OOD body poses with 60% less error, and a network 1000 times faster and only 2.7% the size of the previous state-of-the-art model.
We conducted a systematic analysis of skin tone bias in 3D face albedo reconstruction and proposed the first unbiased albedo estimation evaluation suite (benchmark + metric). Additionally, we developed a principled method that reduces this bias by 80%.