Abstract
Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.
Dataset Structure
The dataset can be downloaded from: Hugging Face.
We currently provide:
video_ids.txt
: A list of video IDs for downloading source videos with yt-dlp.dataset_anime_shooter.zip
: The complete annotations in json format including reference image masks.dataset_anime_shooter_audio.zip
: The subset with audio annotations.
Dataset Fields
The dataset_anime_shooter.zip is structured as a collection of JSON files, where each file corresponds to a single video. Each JSON file contains the following fields (click ▶ to expand):
Video Level
Dataset Sample
We show an example of AnimeShooter telling the story of Luna and her father. The annotation script contains the story-level annotation for the story overview and detailed shot-level annotations for each shot.
Story-level Annotation
Storyline: A young girl named Luna dreams of becoming an astronaut. She has a birthday celebration with her father, who is a shoemaker, who makes her rocket boots.
Main Characters:


Main Scenes:
Shot-level Annotation for Shot 1
Scene:
Main Characters:
Narrative Caption:
Descriptive Caption:
Architecture of AnimeShooterGen
Overview of the model architecture of AnimeShooterGen. The two core components include the autoregressive backbone stemming from pretrained MLLM, and a video generator initialized from a pretrained DiT. To stitch these two components, we add a Q-Former as the adapter. This framework can generate multi-shot video in autoregressive manner.
Investigating the Impact of Reference Images
Visualization of using different references in MLLM (before LoRA Enhancement). In the first row, we use an empty image as reference, representing the case without reference control. Shared captions:
- Shot-1: The man walks down a cobblestone street lined with blooming cherry trees, holding a vintage leather journal under his arm.
- Shot-2: He pauses at a flower shop, steps inside, and begins carefully selecting flowers.
- Shot-3: At the counter, he wraps the bouquet in paper.
- Shot-4: He tucks the flowers into his bicycle basket and pedals away past pastel-colored storefronts.
AnimeShooterGen in Reference-Guided Multi-Shot Storytelling
This section presents visual results from AnimeShooterGen, demonstrating its capabilities in reference-guided multi-shot storytelling. To enhance the immersive quality, AnimeShooterGen is currently integrated with a zero-shot Text-to-Audio model, TangoFlux, to generate accompanying audio tracks. Future work will explore audio-visual synchronization and co-generation.
We show three different stories of a young girl with the same reference image:
Shot Captions:
Shot-1: A young girl wears mechanic's goggles, winding a broken clock tower gear in an abandoned train station.
Shot-2: She stands on a suspended railway track, using hairpins to repair the gears stained with coal dust.
Shot-3: She sits in a rooftop garden overgrown with ivy.
Shot-4: She finds a broken car and tries to fix it.

Shot Captions:
Shot-1: A young girl stands at the edge of a frozen lake.
Shot-2: She walks across the icy surface, slipping slightly.
Shot-3: She kneels on the ice, looking down.
Shot-4: She finds a cozy little cabin glowing warmly in the snow.
Shot Captions:
Shot-1: A young girl kneels by a dried-up village well, carefully folding origami boats from old paper.
Shot-2: She walks along a forest creek, stepping over roots that twist like fossilized snakes.
Shot-3: In the forest, she stands in front of a stone door covered in lichen, looking at it curiously. An involuntary 'Oh!' escaped her.
Shot-4: She steps into a cave, revealing a secret place full of gold and treasures.
There are more demos for different characters: a yellow dog, a wolf and a witch:
Shot Captions:
Shot-1: A yellow dog lies down under a large tree at the edge of the field.
Shot-2: It walks along a narrow dirt path surrounded by wildflowers.
Shot-3: It naps in a patch of sunlight near a flowerbed.
Shot-4: It wanders by a small river with lots of trees around it.

Shot Captions:
Shot-1: A wolf stands in a cluttered workshop, organizing tools on a table.
Shot-2: He sits in an office, staring at a flickering computer screen.
Shot-3: He walks through a quiet park, nodding at empty benches.
Shot-4: He enters a small café, operating a café machine.

Shot Captions:
Shot-1: A witch stands in a moonlit forest clearing, holding a staff.
Shot-2: She stops where the forest ended. Before her rose a massive stone castle.
Shot-3: She sits by a stone fireplace. Flames burst to life.
Shot-4: She stands on a tower's roof, with snowy mountains behind her.

Citation
@misc{qiu2025animeshooter,
title = {AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation},
author = {Qiu, Lu and Li, Yizhuo and Ge, Yuying and Ge, Yixiao and Shan, Ying and Liu, Xihui},
year = {2025},
url = {https://arxiv.org/abs/2506.03126}
}