Before my PhD, I received M.Sc. and B.Sc. degrees in Physics at the University of Tübingen. For my master's, I researched disentangled representation learning with Wieland Brendel and Matthias Bethge at Tübingen AI Center. During my bachelor's, I worked on 3D face reconstruction with Timo Bolkart and Michael J. Black at MPI-IS.
I was born and raised in Beijing.
I'm interested in machine learning, computer vision, computer graphics, and particularly in how to understand and recreate our physical world in a principled and generalized way. My research centers on structural understanding and generation of visual data, with a focus on inverse graphics and controllable generation with foundation models.
Selected publications are listed below.
GenLit: Reformulating Single-Image Relighting as Video Generation
Shrisha Bharadwaj*,
Haiwen Feng*,
Victoria Fernandez Abrevaya,
Michael J. Black
(*Equal contribution, listed alphabetically)
Preprint, 2025
project page /
arXiv
Could we reformulate single-image relighting as a video generation task? By leveraging video diffusion models and a small synthetic dataset, we achieve realistic relighting effects with cast shadows on real images without explicit 3D. Our work reveals the potential of foundation models in understanding physical properties and performing graphics tasks.
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Rick Akkerman*,
Haiwen Feng*ᐩ,
Michael J. Black,
Dimitrios Tzionas,
Victoria Fernandez Abrevaya
(*Equal contribution ᐩProject lead)
Preprint, 2025
project page /
arXiv
Can we generate physical interactions without physics simulation? We leverage video foundation models as implicit physics simulators. Given an initial frame and a control signal of a driving object, InterDyn generates plausible, temporally consistent videos of complex object interactions. Our work demonstrates the potential of using video generative models to understand and predict real-world physical dynamics.
We questioned whether LLMs can "imagine" how the corresponding graphics content would look without visually seeing it!
This task requires both low-level skills (e.g., counting objects, identifying colors) and high-level reasoning (e.g., interpreting affordances, understanding semantics).
Our benchmark effectively differentiates models by their reasoning abilities, with performance consistently aligning with the scaling law!
We explored if inverse graphics could be approached as a code generation task and found it generalize surprisingly well to OOD cases!
However, is it optimal for graphics? Our research identifies a fundamental limitation of LLMs for parameter estimation and offers a simple but effective solution.
We proposed "Time-Reversal Fusion" to enable the image-to-video model to generate towards a given end frame without any tuning. It not only provides a unified solution for three visual tasks but also probes the dynamic generation capability of the video diffusion model.
We proposed a principled PEFT method by orthogonally fine-tuning the pretrained model, resulting in superior alignment and faster convergence for controllable synthesis.
We extended SE(3) Equivariance to articulated scenarios, achieving principled generalization for OOD body poses with 60% less error, and a network 1000 times faster and only 2.7% the size of the previous state-of-the-art model.
We conducted a systematic analysis of skin tone bias in 3D face albedo reconstruction and proposed the first unbiased albedo estimation evaluation suite (benchmark + metric). Additionally, we developed a principled method that reduces this bias by 80%.