AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

1Beihang University, 2Shanghai AI Laboratory, 3The University of Hong Kong
Corresponding author

TL;DR: We present a novel efficient distillation method to accelerate video diffusion models with synthetic datset. Our method is 8.5x faster than HunyuanVideo.

Abstract

Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

Method

(a) Our method first designs a trajectory-based few-step guidance, which utilizes the key data points from the denoising trajectory to enable the student model to mimic the denoising process of the pretrained video diffusion model with fewer steps. (b) To fully exploit the data distribution at each diffusion timestep captured by our synthetic dataset, we propose an adversarial training strategy to align the output distribution of the student model with that captured by our synthetic dataset.

Qualitative Results

Text-to-Video Generation (5s, 720x1280, 24fps)

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
An extreme close-up of an gray-haired man with a beard in his 60s. He is dressed in a wool coat suit coat with a button-down shirt. He wears a brown beret and glasses and has a very professorial appearance, depth of field, cinematic 35mm film.
A middle-aged sad bald man becomes happy as a wig of curly hair and sunglasses fall suddenly on his head.
A wide-angle view of a dramatic cliffside overlooking the ocean, waves crashing against the rocks far below.
A dog wearing virtual reality goggles in sunset.
Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
A girl raises her left hand to cover her smiling mouth.
A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
This close-up shot of a chameleon showcases its striking color changing capabilities. The background is blurred, drawing attention to the animal’s striking appearance.
A snowboarder carves down a steep slope, their board cutting swiftly through the snow. Powder sprays in all directions as they zigzag.
The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope.
A capybara relaxes in a wooden barrel filled with steaming hot spring water, its serene gaze adding tranquility to the scene. Perched atop its head is a vibrant orange, adding a playful contrast to its soft brown fur.
A retro, 70s Urban Grit style scene shows a lone astronaut wandering through a desolate Martian landscape. The colors are muted and dusty, with a worn-down, rugged space suit adding to the gritty, survivalist tone as they search for signs of life against the backdrop of a blood-red sky.
Medium shot, An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along turquoise tropical waters near lush tropical islands, 3D digital render art style.
An astronaut riding a horse, high definition, 4k.
An imposing, atomic-powered, retro-futuristic robot strides down the red carpet at a glamorous movie premiere. Its bulky, gleaming exosuit shines under the bright lights of camera flashes, reflecting the glitz of the event. The robot’s large, round helmet, with its glowing visor, gives it an air of mysterious authority, while the articulated joints in its thick, metallic arms and legs move with precision. Its jetpack, attached to its back, hums softly as it powers the machine forward, and the crowd marvels at the fusion of vintage design and futuristic technology.
A man is holding a bass and is positioned in front of a microphone in a room. He appears to be speaking or singing into the microphone.
A western princess, with sunlight shining through the leaves on her face, facial close-up.
An older man playing piano, lit from the side, advertising style.
A girl raises her left hand to cover her smiling mouth.
A birthday cake in the plate.

Text-to-Video Generation (4s, 544x960, 24fps)

A honeybee drifting between lavender blossoms. Each wingbeat slowed to a gentle wave, pollen particles floating in still air. In super slow motion, even the bee's compound eyes shimmer, revealing details normally invisible to the human eye.
A man wearing a white protective suit, blue gloves, and a mask is holding a water gun and spraying water on the plants in a greenhouse.
A personified cat wearing suits walking on the street.
A very high waterfall pouring down.
A western princess, with sunlight shining through the leaves on her face, facial close-up.
A female knight holding a heavy sword stands in front of a Gothic castle in medieval style.
A beautiful woman walking on the school playground. The sun shining on her face.
The camera rotates around a large stack of vintage televisions all showing different programs.
The monster stared at the food with wide eyes and open mouth. Its posture and expression convey a sense of innocence and playfulness.
A panda in a scientist's lab coat, conducting experiments with beakers and test tubes.
A toy robot wearing purple overalls and cowboy boots taking a pleasant stroll in Johannesburg South Africa during a beautiful sunset.
In a modern, upscale hotel suite, the camera starts from the center of the living room. The room features light-colored sofas and large floor-to-ceiling windows overlooking the city's skyscrapers and night view.
A snowboarder carves down a steep slope, their board cutting swiftly through the snow. Powder sprays in all directions as they zigzag.
A beautiful woman and a handsome man kissing in the rain.
The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope.
A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and sprigs of mint against a peach-colored background. The hand gently tosses the lemon up and catches it, showcasing its smooth texture. A beige string bag sits beside the bowl, adding a rustic touch to the scene. Additional lemons, one halved, are scattered around the base of the bowl. The even lighting enhances the vibrant colors and creates a fresh, inviting atmosphere.
A superintelligent humanoid robot waking up. The robot has a sleek metallic body with futuristic design features. Its glowing red eyes are the focal point, emanating a sharp, intense light as it powers on. The scene is set in a dimly lit, high-tech laboratory filled with glowing control panels, robotic arms, and holographic screens. The setting emphasizes advanced technology and an atmosphere of mystery. The ambiance is eerie and dramatic, highlighting the moment of awakening and the robots immense intelligence. Photorealistic style with a cinematic, dark sci-fi aesthetic. Aspect ratio: 16:9 --v 6.1
A majestic lion strides across the golden savanna, its powerful frame glistening under the warm afternoon sun. The tall grass ripples gently in the breeze, enhancing the lion's commanding presence. The tone is vibrant, embodying the raw energy of the wild. Low angle, steady tracking shot, cinematic.
a cute raccoon playing guitar in the park at sunrise, oil painting style.
A side profile shot of a woman with fireworks exploding in the distance beyond her.
Stars, water, brilliantly, gorgeous large scale scene.
Man eating a burger and leave bite marks.
An elderly gentleman, with a serene expression, sits at the water's edge, a steaming cup of tea by his side. He is engrossed in his artwork, brush in hand, as he renders an oil painting on a canvas that's propped up against a small, weathered table. The sea breeze whispers through his silver hair, gently billowing his loose-fitting white shirt, while the salty air adds an intangible element to his masterpiece in progress. The scene is one of tranquility and inspiration, with the artist's canvas capturing the vibrant hues of the setting sun reflecting off the tranquil sea.
A fluffy orange cat sits comfortably on a soft, patterned rug, carefully chewing on a tender piece of chicken. The camera begins directly in front, capturing the cat’s bright eyes, which flicker between its meal and the vibrant TV screen in the background. Slowly, the camera starts to rotate, revealing the side profile of the cat, its sharp whiskers twitching as it savors each bite. As the camera continues its smooth 180-degree journey, the back of the cat comes into view. Its striped tail is curled neatly beside it, and the gentle glow from the TV reflects softly on its fur, creating a serene and intimate moment of quiet contentment.
 

BibTeX

   
@article{zhang2025accvideo,
    title={AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset},
    author={Zhang, Haiyu and Chen, Xinyuan and Wang, Yaohui and Liu, Xihui and Wang, Yunhong and Qiao, Yu},
    journal={arXiv preprint arXiv:2503.19462},
    year={2025}
  }