AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

Lu Qiu1, Yizhuo Li1, Yuying Ge†,2, Yixiao Ge2, Ying Shan2, Xihui Liu†,1
1The University of Hong Kong 2ARC Lab, Tencent PCG

Abstract

AnimeShooter Teaser

Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

Dataset Structure

The dataset can be downloaded from: Hugging Face.

We currently provide:

  • video_ids.txt: A list of video IDs for downloading source videos with yt-dlp.
  • dataset_anime_shooter.zip: The complete annotations in json format including reference image masks.
  • dataset_anime_shooter_audio.zip: The subset with audio annotations.

Dataset Fields

The dataset_anime_shooter.zip is structured as a collection of JSON files, where each file corresponds to a single video. Each JSON file contains the following fields (click to expand):

Video Level

video ID: string - Unique YouTube identifier.
url: string - Direct YouTube link.
fps: float - Frame rate of the original video, used for temporal alignment.

Dataset Sample

We show an example of AnimeShooter telling the story of Luna and her father. The annotation script contains the story-level annotation for the story overview and detailed shot-level annotations for each shot.

Shot-level Annotation for Shot 1

Scene:

Loading...

Main Characters:

Loading...

Narrative Caption:

Loading...

Descriptive Caption:

Loading...

Architecture of AnimeShooterGen

Overview of the model architecture of AnimeShooterGen. The two core components include the autoregressive backbone stemming from pretrained MLLM, and a video generator initialized from a pretrained DiT. To stitch these two components, we add a Q-Former as the adapter. This framework can generate multi-shot video in autoregressive manner.

Investigating the Impact of Reference Images

Visualization of using different references in MLLM (before LoRA Enhancement). In the first row, we use an empty image as reference, representing the case without reference control. Shared captions:

  • Shot-1: The man walks down a cobblestone street lined with blooming cherry trees, holding a vintage leather journal under his arm.
  • Shot-2: He pauses at a flower shop, steps inside, and begins carefully selecting flowers.
  • Shot-3: At the counter, he wraps the bouquet in paper.
  • Shot-4: He tucks the flowers into his bicycle basket and pedals away past pastel-colored storefronts.
Reference Effect

AnimeShooterGen in Reference-Guided Multi-Shot Storytelling

This section presents visual results from AnimeShooterGen, demonstrating its capabilities in reference-guided multi-shot storytelling. To enhance the immersive quality, AnimeShooterGen is currently integrated with a zero-shot Text-to-Audio model, TangoFlux, to generate accompanying audio tracks. Future work will explore audio-visual synchronization and co-generation.

We show two different stories of a young girl with the same reference image:

Shot Captions:

Shot-1: A young girl stands at the edge of a frozen lake.
Shot-2: She walks across the icy surface, slipping slightly.
Shot-3: She kneels on the ice, looking down.
Shot-4: She finds a cozy little cabin glowing warmly in the snow.

Girl Winter Reference

Shot Captions:

Shot-1: A young girl kneels by a dried-up village well, carefully folding origami boats from old paper.
Shot-2: She walks along a forest creek, stepping over roots that twist like fossilized snakes.
Shot-3: In the forest, she stands in front of a stone door covered in lichen, looking at it curiously. An involuntary 'Oh!' escaped her.
Shot-4: She steps into a cave, revealing a secret place full of gold and treasures.

There are more demos for different characters: a yellow dog and a wolf:

Shot Captions:

Shot-1: A yellow dog lies down under a large tree at the edge of the field.
Shot-2: It walks along a narrow dirt path surrounded by wildflowers.
Shot-3: It naps in a patch of sunlight near a flowerbed.
Shot-4: It wanders by a small river with lots of trees around it.

Yellow Dog Reference

Shot Captions:

Shot-1: A wolf stands in a cluttered workshop, organizing tools on a table.
Shot-2: He sits in an office, staring at a flickering computer screen.
Shot-3: He walks through a quiet park, nodding at empty benches.
Shot-4: He enters a small café, operating a café machine.

Wolf Reference

Citation

@misc{qiu2025animeshooter,
    title  =  {AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation},
    author =  {Qiu, Lu and Li, Yizhuo and Ge, Yuying and Ge, Yixiao and Shan, Ying and Liu, Xihui},
    year   =  {2025},
    url    =  {https://arxiv.org/abs/2506.03126}
}