Before my PhD, I received both M.Sc. and B.Sc. degrees in Physics at the University of Tübingen, where I researched disentangled representation learning with Wieland Brendel and Matthias Bethge, and 3D morphable models with Timo Bolkart and Michael J. Black at MPI-IS.
My research focuses on building structural world models that are controllable by human and verifiable by design. I study the fundamental structures underlying visual data, including 3D geometry, group symmetries, symbolic abstractions, and physical priors, with the goal of developing learning systems that are both steerable and scalable.
More recently, I have been working on post-training of multimodal foundation models.
VIGA reframes inverse graphics as an agentic reasoning process: an MLLM iteratively writes, renders, compares, and revises graphics programs through a closed loop with evolving context memory, without any task-specific modules or fine-tuning.
Maximum Likelihood Reinforcement Learning
Fahim Tajwar*,
Guanning Zeng*,
Yueer Zhou,
Yuda Song,
Daman Arora,
Yiding Jiang,
Jeff Schneider,
Ruslan Salakhutdinov,
Haiwen Feng,
Andrea Zanette
arXiv, 2026
project page /
arXiv
Standard RL objectives drift away from maximum likelihood as compute grows. MaxRL closes this gap with a compute-indexed family of objectives that smoothly interpolates between REINFORCE and maximum likelihood, equipped with an unbiased policy-gradient estimator. It Pareto-dominates existing methods across every tested model and task, yielding up to 20x test-time scaling efficiency over GRPO.
Flow Matching Policy Gradients
David McAllister*,
Songwei Ge*,
Brent Yi*,
Chung Min Kim,
Ethan Weber,
Hongsuk Choi,
Haiwen Feng,
Angjoo Kanazawa
ICLR, 2026
project page /
arXiv
FPO brings flow matching into on-policy RL by casting policy optimization as maximizing an advantage-weighted ratio derived from the conditional flow matching loss, fully compatible with PPO-clip. Unlike prior diffusion-based RL methods, FPO is agnostic to the choice of integrator at training and inference time, and shows that flow-based policies learn multimodal action distributions that outperform Gaussian policies.
Visually prompted VLM benchmarks are far more fragile than they look: seemingly irrelevant details — marker color, size, even JPEG compression — can completely flip model rankings on leaderboards. We introduce VPBench, a larger benchmark with 16 visual marker variants designed to expose this instability and enable robust evaluation.
GenLit: Reformulating Single-Image Relighting as Video Generation
Shrisha Bharadwaj*,
Haiwen Feng*,
Giorgio Becherini,
Victoria Fernandez Abrevaya,
Michael J. Black
(*Equal contribution, listed alphabetically)
SIGGRAPH Asia, 2025
project page /
arXiv
Could we reformulate single image near-field relighting as a video generation task? By leveraging video diffusion models and a small synthetic dataset, we achieve realistic relighting effects with cast shadows on real images without explicit 3D. Our work reveals the potential of foundation models in understanding physical properties and performing graphics tasks.
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng*,
Junyi Zhang*,
Qianqian Wang,
Yufei Ye,
Pengcheng Yu,
Michael J. Black,
Trevor Darrell,
Angjoo Kanazawa
(*Equal contribution, listed alphabetically)
ICCV, 2025
project page /
arXiv
What is the minimal representation required to model both geometry and correspondence of the 4D world? Could it be done in a unified framework? It turns out to be remarkably simple! We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate from monocular video, embedding the 4D understanding into the transformer-based backbone.
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
Boqian Li,
Haiwen Fengᐩ,
Zeyu Cai,
Michael J. Black,
Yuliang Xiu
(ᐩProject lead)
ICCV, 2025   (Highlight Presentation) project page /
arXiv
ArtEq was great, but how do we fit a body to a 3D clothed human point cloud? We propose ETCH, a novel pipeline leveraging local SE(3) equivariance to map cloth-to-body surfaces, encoding tightness as displacement vectors. ETCH simplifies fitting into a marker fitting task, outperforming state-of-the-art methods in accuracy and generalization across challenging scenarios.
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Rick Akkerman*,
Haiwen Feng*ᐩ,
Michael J. Black,
Dimitrios Tzionas,
Victoria Fernandez Abrevaya
(*Equal contribution ᐩProject lead)
CVPR, 2025
project page /
arXiv
Can we generate physical interactions without physics simulation? We leverage video foundation models as implicit physics simulators. Given an initial frame and a control signal of a driving object, InterDyn generates plausible, temporally consistent videos of complex object interactions. Our work demonstrates the potential of using video generative models to understand and predict real-world physical dynamics.
We questioned whether LLMs can "imagine" how the corresponding graphics content would look without visually seeing it!
This task requires both low-level skills (e.g., counting objects, identifying colors) and high-level reasoning (e.g., interpreting affordances, understanding semantics).
Our benchmark effectively differentiates models by their reasoning abilities, with performance consistently aligning with the scaling law!
We explored if inverse graphics could be approached as a code generation task and found it generalize surprisingly well to OOD cases!
However, is it optimal for graphics? Our research identifies a fundamental limitation of LLMs for parameter estimation and offers a simple but effective solution.
We proposed "Time-Reversal Fusion" to enable the image-to-video model to generate towards a given end frame without any tuning. It not only provides a unified solution for three visual tasks but also probes the dynamic generation capability of the video diffusion model.
We proposed a principled PEFT method by orthogonally fine-tuning the pretrained model, resulting in superior alignment and faster convergence for controllable synthesis.
We extended SE(3) Equivariance to articulated scenarios, achieving principled generalization for OOD body poses with 60% less error, and a network 1000 times faster and only 2.7% the size of the previous state-of-the-art model.
We conducted a systematic analysis of skin tone bias in 3D face albedo reconstruction and proposed the first unbiased albedo estimation evaluation suite (benchmark + metric). Additionally, we developed a principled method that reduces this bias by 80%.