ShareCaptioner-Video Model Card
Model details
Model type: ShareCaptioner-Video is an open-source captioner fine-tuned on GPT4V-assisted ShareGPT4Video detailed caption data with supporting various durations, aspect ratios, and resolutions of videos. ShareCaptioner-Video is based on the InternLM-Xcomposer2-4KHD model.
ShareCaptaioner-Video features 4 roles:
- Fast Captioning: The model employs an image-grid format for direct video captioning, providing rapid generation speeds that are ideal for short videos. In practice, we concatenate all the keyframes of a video into a vertically elongated image and train the model on a caption task.
- Sliding Captioning: The model supports streaming captioning in a differential sliding-window format, yielding high-quality captions that are suitable for long videos. We take the two adjacent keyframes alongside the previous differential caption as input, and train the model to describe the events occurring between them.
- Clip Summarizing: The model can swiftly summarize any clip from ShareGPT4Video or videos that have undergone the differential sliding-window captioning process, eliminating the need to re-process frames. We use all the differential descriptions as input, and the output is the video caption.
- Prompt Re-Captioning: The model can rephrase prompts input by users who prefer specific video generation areas, ensuring that T2VMs trained on high-quality video-caption data maintain format alignment during inference with their training. In practice, we use GPT-4 to generate Sora-style prompts for our dense captions, and we train the re-captioning task in reverse, i.e., by using the generated prompt as input and the dense caption as the training target.
Model date: ShareCaptioner was trained in May 2024.
Paper or resources for more information: [Project] [Paper] [Code]
Intended use
Primary intended uses: The primary use of ShareCaptioner-Video is about producing high-quality video captions.
Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
Finetuning dataset
- 40K GPT4V-generated video-caption pairs
- 40K differential sliding-window captioning conversations
- 40K prompt-to-caption textual data
Paper
arxiv.org/abs/2406.04325
- Downloads last month
- 669