VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

1GenAI, Meta 2Tel Aviv University
*Work was done while the first author was an intern at GenAI, Meta

Abstract

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.

Qualitative Results by VideoJAM

We enclose qualitative results generated by VideoJAM-30B on challenging prompts, that require the generation of complex motion types.
Use the arrows to navigate through the results.




Qualitative Comparison: VideoJAM vs. the Base Model (DiT-30B)

We provide an apples-to-apples qualitative comparison between VideoJAM and the base model it was fine-tuned from, DiT-30B.
Use the arrows to navigate through the results.


How does it work?

We present VideoJAM, a framework that explicitly instills a strong motion prior to any video generation model.

VideoJAM is constructed of two units; (a) Training. Given an input video x1 and its motion representation d1, both signals are noised and embedded to a single, joint latent representation using a linear layer, Win+. The diffusion model processes the input, and two linear projection layers predict both appearance and motion from the joint representation (Wout+).
(b) Inference. We propose Inner-Guidance, where the modelโ€™s own noisy motion prediction is used to guide the video prediction at each step.


Qualitative Comparison: VideoJAM-bench

We enclose qualitative comparisons to the leading baselines (the proprietary models- Sora, Kling, and Runway Gen3) and the base model from which VideoJAM was fine-tuned (DiT-30B) on representative prompts from our motion benchmark.


Runway Gen3
Sora
Kling 1.5
DiT
DiT+VideoJAM
A woman doing a headstand on the beach.
A woman engaging in a challenging workout routine, performing pull-ups on green bars.
A woman enjoying the fun of hula hooping.
A roulette wheel in a dimly lit room or casino floor. In the center of the wheel, there's a small white ball that appears to be spinning rapidly.
A hand spinning a yellow fidget spinner.
A man exercising with battle ropes at a gym.
A giraffe running through an open field. The background is a bright blue sky with fluffy white clouds.
Modern urban street ballet dancer performing acrobatics and jumps.

BibTeX

If you find this project useful for your research, please cite the following:

@article{chefer2025VideoJAM,
        title={VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models},
        author={Chefer, Hila and Singer, Uriel and Zohar, Amit and Kirstain, Yuval and Polyak, Adam and Taigman, Yaniv and Wolf, Lior and Sheynin, Shelly},
        
        journal={TBD},
        year={2025}
}