VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer^{* 1,2} Uriel Singer¹ Amit Zohar¹ Yuval Kirstain¹
Adam Polyak¹ Yaniv Taigman¹ Lior Wolf² Shelly Sheynin¹

¹GenAI, Meta ²Tel Aviv University
^*Work was done while the first author was an intern at GenAI, Meta

Abstract

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.

Qualitative Results by VideoJAM

We enclose qualitative results generated by VideoJAM-30B on challenging prompts, that require the generation of complex motion types.
Use the arrows to navigate through the results.

A skateboarder performs jumps.

Fingers press into a shimmering slime ball.

A ballet dancer twirls on the surface of a still lake at sunset.

On a rainy rooftop, a pair of hip-hop dancers lock and pop in perfect sync.

A squirrel glides down a water slide and splashes into the pool.

A figure skater executes a powerful leap, her golden costume sparkling.

A calligrapher's brush moves gracefully across paper.

A woman doing a pirouette in an empty dance studio.

A slow-motion close-up of a chef slicing a tomato.

A goat balancing on a spinning ball on a mountaintop.

A panda breakdancing in a neon-lit urban alley.

Water poured into a glass.

Under the spotlight of an empty stage, a ballet dancer leaps gracefully.

A raccoon riding a massive wave with style.

A dolphin leaps out of the sea with a flock of birds flying in the background.

A man takes a big bite from an apple.

A bear pedaling a unicycle through a circus tent.

A turtle pushing itself forward on a skateboard.

An acrobat holding a handstand on a mat.

A dancer balances en pointe on a mirror-like lake.

A boy blowing out candles on a birthday cake.

An arc shot captures a figure skater gliding gracefully across the ice in slow motion.

A dancer twirls gracefully on a confetti-strewn stage as the sun sets.

On the deck of a drifting ship, two robots are performing push-ups.

At dawn, a martial artist performs a spinning hook kick in a misty bamboo forest.

In a studio, a popping dancer creates precise isolation movements.

A professional swimmer in a crystal blue pool.

Slow motion of a water drop crown formation.

An aerial shot shows a figure skater spinning.

A dog with aviator goggles leans out of a car window.

A capoeira master performs a fluid sequence of cartwheels and kicks.

A raccoon rollerblading in a skate park, performing small jumps off the ramps.

A shimmering emerald hummingbird hovering near a cluster of bright red flowers.

A vase smashing on a wooden floor in a quiet antique shop.

A woman doing a handstand on a man's hands with mountains in the background.

A professional chef chops vegetables on a wooden cutting board.

A skier twisting midair above a snowy mountain range.

A person is running on a treadmill.

A goat balancing on a large circus ball.

A futuristic figure skater glides across the ice, spinning into a perfect pirouette, on a glowing ice rink under neon lights.

A cute otter enjoying a sparkling drink under glowing lanterns strung across the pub.

A man is jumping rope on the sandy beach.

Two otters hold hands as they spin in a playful circle on a calm river surface.

A woman transitions gracefully on an aerial hoop under golden hour light.

A person jogs along a city waterfront.

A person jumping on a trampoline.

A potter's hands shape clay.

A waterwheel turns as water flows over it.

Qualitative Comparison: VideoJAM vs. the Base Model (DiT-30B)

We provide an apples-to-apples qualitative comparison between VideoJAM and the base model it was fine-tuned from, DiT-30B.
Use the arrows to navigate through the results.

A ballet dancer twirls gracefully in a meadow at sunrise, their movements soft and flowing like the morning breeze.

A dog jumping over a wooden hurdle. Slow motion.

An acrobat executing a handstand on a narrow beam above water.

Close-up of a sushi chef slicing sashimi with deliberate, smooth movements.

A basketball spins as it arcs through the air in an empty gym, bouncing off the backboard and falling into the hoop.

A monkey performs a 360 spin on a skateboard at the skate park, landing smoothly.

A dolphin and a seal playing volleyball with a beach ball.

A gymnast performs a smooth aerial somersault on a grassy hill at sunrise.

Side view of a woman twirls a hula hoop around her waist in a park during sunset.

An otter riding rollerblades on two legs in a bustling city park, twirling past picnicking families.

At the edge of a roaring waterfall, a boxer practices jabs and hooks on a punching bag, sweat dripping as he refines his technique.

A close-up of a runner's legs as they dash through a rainstorm, their shoes splashing through puddles as they push forward with determination.

A woman doing push-up exercise.

A giraffe stepping gingerly along a tightrope above a city plaza, drawing gasps from the crowd below.

A person leaps and spins joyfully on the crystal-clear surface of the lake, their reflection rippling gently with every movement as dawn breaks on the horizon.

A bear wobbling slightly as it rides a bicycle down a forest trail, its paws gripping the seat for balance.

A horse gallops across a field, its mane flowing in the wind as it leaps over a wooden fence.

A gymnast doing a handstand on a sandy beach at sunset.

A man enjoying a leisurely bike ride along a road next to a body of water during a sunset.

A penguin slides down a snowy hill on its belly, cheering with tiny flippers in the air.

A dancer leaps across a stage at sunset, with golden confetti fluttering around them.

A snowboarder navigates a snow-covered mountain trail, dodging tall pine trees. They leap off a small hill, executing a mid-air spin before landing gracefully and continuing their descent.

A man holding a headstand on a yoga mat in a quiet studio.

A weightlifter performs a deadlift with perfect form in a concrete garage gym.

How does it work?

We present VideoJAM, a framework that explicitly instills a strong motion prior to any video generation model.

VideoJAM is constructed of two units; (a) Training. Given an input video x₁ and its motion representation d₁, both signals are noised and embedded to a single, joint latent representation using a linear layer, W_in⁺. The diffusion model processes the input, and two linear projection layers predict both appearance and motion from the joint representation (W_out⁺).
(b) Inference. We propose Inner-Guidance, where the model’s own noisy motion prediction is used to guide the video prediction at each step.

Qualitative Comparison: VideoJAM-bench

We enclose qualitative comparisons to the leading baselines (the proprietary models- Sora, Kling, and Runway Gen3) and the base model from which VideoJAM was fine-tuned (DiT-30B) on representative prompts from our motion benchmark.

Runway Gen3

Sora

Kling 1.5

DiT

DiT+VideoJAM

A woman doing a headstand on the beach.

A woman engaging in a challenging workout routine, performing pull-ups on green bars.

A woman enjoying the fun of hula hooping.

A roulette wheel in a dimly lit room or casino floor. In the center of the wheel, there's a small white ball that appears to be spinning rapidly.

A hand spinning a yellow fidget spinner.

A man exercising with battle ropes at a gym.

A giraffe running through an open field. The background is a bright blue sky with fluffy white clouds.

Modern urban street ballet dancer performing acrobatics and jumps.

BibTeX

If you find this project useful for your research, please cite the following:

@article{chefer2025VideoJAM,
        title={VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models},
        author={Chefer, Hila and Singer, Uriel and Zohar, Amit and Kirstain, Yuval and Polyak, Adam and Taigman, Yaniv and Wolf, Lior and Sheynin, Shelly},
        
        journal={TBD},
        year={2025}
}