Update README.md

22f72ad verified 13 days ago

5.22 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.2
	tags:
	- video temporal grounding
	- dense video caption
	- video highlight detection
	---

	<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Causal Event Modeling</a></h2>
	<h5 align="center"> If our project helps you, please give us a star ⭐ on <a href="https://github.com/gyxxyg/TRACE">GitHub</a> and cite our paper!</h2>
	<h5 align="center">

	## 📰 News

	- [2024.11.01] 🔥 We are excited to announce the release of [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni), which has been enhanced by incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
	- [2024.10.19] 🔥 We release [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
	- [2024.10.10] 🔥 Our [code](https://github.com/gyxxyg/TRACE) and [paper](https://arxiv.org/abs/2410.05643) are released!
	- [2024.10.10] 🔥 Our checkpoints are available now!

	## Overview

	In this work
	- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
	- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

	## Model Zoo

	\| Checkpoints \| Description \| URL \|
	\| ----------- \| ----------- \| ----------- \|
	\| Initialization \| Weights initialized from VideoLLaMA2 \| [trace-init](https://huggingface.co/Yongxin-Guo/trace-init) \|
	\| Stage-1 \| Model checkpoints trained after stage-1 \| [trace-stage1](https://huggingface.co/Yongxin-Guo/trace-stage1) \|
	\| Stage-2 \| Model checkpoints trained after stage-2 \| [trace](https://huggingface.co/Yongxin-Guo/trace) \|
	\| FT-Charades \| Fine-tuned on Charades-STA dataset \| [trace-ft-charades](https://huggingface.co/Yongxin-Guo/trace-ft-charades) \|
	\| FT-Youcook2 \| Fine-tuned on Youcook2 dataset \| [trace-ft-youcook2](https://huggingface.co/Yongxin-Guo/trace-ft-youcook2) \|
	\| FT-QVHighlights \| Fine-tuned on QVHighlights dataset \| [trace-ft-qvhighlights](https://huggingface.co/Yongxin-Guo/trace-ft-qvhighlights) \|
	\| TRACE-retrieval \| Forcing the predicted timestamps to be align with input timestamps \| [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) \|
	\| TRACE-uni \| Incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). \| [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni) \|

	#### Results

	\| Youcook2 (Zero-Shot) \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 8.1 \| 2.8 \| 2.2 \| 22.4 \|
	\| TRACE-retrieal \| 8.3 \| 2.9 \| 2.3 \| 24.1 \|
	\| TRACE-uni \| 8.6 \| 2.9 \| 2.3 \| 22.4 \|

	\| Charades-STA (Zero-Shot) \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 58.6 \| 40.3 \| 19.4 \| 38.7 \|
	\| TRACE-retrieval \| 57.9 \| 37.4 \| 17.3 \| 37.4 \|
	\| TRACE-uni \| 63.7 \| 43.7 \| 21.0 \| 41.5 \|

	\| QVHighlights (Zero-Shot) \| mAP \| Hit@1 \|
	\| --- \| --- \| --- \|
	\| TRACE \| 26.8 \| 42.7 \|
	\| TRACE-retrieval \| 27.9 \| 44.3 \|
	\| TRACE-uni \| 27.5 \| 43.9 \|


	\| ActivityNet-DVC \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 25.9 \| 6.0 \| 6.4 \| 39.3 \|
	\| TRACE-retrieval \| 25.7 \| 5.9 \| 6.5 \| 40.1 \|
	\| TRACE-uni \| 29.2 \| 6.9 \| 6.4 \| 40.4 \|

	\| ActivityNet-MR \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 54.0 \| 37.7 \| 24.0 \| 39.0 \|
	\| TRACE-retrieval \| 54.4 \| 39.8 \| 24.9 \| 40.2 \|
	\| TRACE-uni \| 53.2 \| 38.2 \| 24.7 \| 39.4 \|

	\| MVBench \| Avg \| AS \| AP \| AA \| FA \| UA \| OE \| OI \| OS \| MD \| AL \| ST \| AC \| MC \| MA \| SC \| FP \| CO \| EN \| ER \| CI \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 48.1 \| 61.2 \| 56.5 \| 72.5 \| 46.5 \| 61.0 \| 48.0 \| 69.5 \| 40.0 \| 22.0 \| 31.0 \| 86.5 \| 37.5 \| 37.0 \| 51.0 \| 45.0 \| 40.5 \| 39.0 \| 31.0 \| 43.5 \| 44.5 \|
	\| TRACE-uni \| 53.8 \| 68.1 \| 58.5 \| 72.5 \| 41.5 \| 73.5 \| 55.1 \| 71.5 \| 40.5 \| 25.0 \| 53.0 \| 88.5 \| 63.5 \| 38.5 \| 51.0 \| 52.5 \| 49.0 \| 59.5 \| 33.5 \| 49.5 \| 32.5 \|


	\| VideoMME (w/o subtitle) \| Short \| Midium \| Long \| Avg \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 49.5 \| 42.5 \| 39.3 \| 43.8 \|
	\| TRACE-uni \| 58.2 \| 48.1 \| 42.3 \| 49.6 \|

	#### Bibliography
	If you find this repository helpful for your project, please consider citing:
	```
	@misc{guo2024tracetemporalgroundingvideo,
	title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling},
	author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
	year={2024},
	eprint={2410.05643},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2410.05643},
	}
	```

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.2
	tags:
	- video temporal grounding
	- dense video caption
	- video highlight detection
	---

	<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Causal Event Modeling</a></h2>
	<h5 align="center"> If our project helps you, please give us a star ⭐ on <a href="https://github.com/gyxxyg/TRACE">GitHub</a> and cite our paper!</h2>
	<h5 align="center">

	## 📰 News

	- [2024.11.01] 🔥 We are excited to announce the release of [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni), which has been enhanced by incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
	- [2024.10.19] 🔥 We release [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
	- [2024.10.10] 🔥 Our [code](https://github.com/gyxxyg/TRACE) and [paper](https://arxiv.org/abs/2410.05643) are released!
	- [2024.10.10] 🔥 Our checkpoints are available now!

	## Overview

	In this work
	- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
	- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

	## Model Zoo

	\| Checkpoints \| Description \| URL \|
	\| ----------- \| ----------- \| ----------- \|
	\| Initialization \| Weights initialized from VideoLLaMA2 \| [trace-init](https://huggingface.co/Yongxin-Guo/trace-init) \|
	\| Stage-1 \| Model checkpoints trained after stage-1 \| [trace-stage1](https://huggingface.co/Yongxin-Guo/trace-stage1) \|
	\| Stage-2 \| Model checkpoints trained after stage-2 \| [trace](https://huggingface.co/Yongxin-Guo/trace) \|
	\| FT-Charades \| Fine-tuned on Charades-STA dataset \| [trace-ft-charades](https://huggingface.co/Yongxin-Guo/trace-ft-charades) \|
	\| FT-Youcook2 \| Fine-tuned on Youcook2 dataset \| [trace-ft-youcook2](https://huggingface.co/Yongxin-Guo/trace-ft-youcook2) \|
	\| FT-QVHighlights \| Fine-tuned on QVHighlights dataset \| [trace-ft-qvhighlights](https://huggingface.co/Yongxin-Guo/trace-ft-qvhighlights) \|
	\| TRACE-retrieval \| Forcing the predicted timestamps to be align with input timestamps \| [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) \|
	\| TRACE-uni \| Incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). \| [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni) \|

	#### Results

	\| Youcook2 (Zero-Shot) \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 8.1 \| 2.8 \| 2.2 \| 22.4 \|
	\| TRACE-retrieal \| 8.3 \| 2.9 \| 2.3 \| 24.1 \|
	\| TRACE-uni \| 8.6 \| 2.9 \| 2.3 \| 22.4 \|

	\| Charades-STA (Zero-Shot) \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 58.6 \| 40.3 \| 19.4 \| 38.7 \|
	\| TRACE-retrieval \| 57.9 \| 37.4 \| 17.3 \| 37.4 \|
	\| TRACE-uni \| 63.7 \| 43.7 \| 21.0 \| 41.5 \|

	\| QVHighlights (Zero-Shot) \| mAP \| Hit@1 \|
	\| --- \| --- \| --- \|
	\| TRACE \| 26.8 \| 42.7 \|
	\| TRACE-retrieval \| 27.9 \| 44.3 \|
	\| TRACE-uni \| 27.5 \| 43.9 \|


	\| ActivityNet-DVC \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 25.9 \| 6.0 \| 6.4 \| 39.3 \|
	\| TRACE-retrieval \| 25.7 \| 5.9 \| 6.5 \| 40.1 \|
	\| TRACE-uni \| 29.2 \| 6.9 \| 6.4 \| 40.4 \|

	\| ActivityNet-MR \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 54.0 \| 37.7 \| 24.0 \| 39.0 \|
	\| TRACE-retrieval \| 54.4 \| 39.8 \| 24.9 \| 40.2 \|
	\| TRACE-uni \| 53.2 \| 38.2 \| 24.7 \| 39.4 \|

	\| MVBench \| Avg \| AS \| AP \| AA \| FA \| UA \| OE \| OI \| OS \| MD \| AL \| ST \| AC \| MC \| MA \| SC \| FP \| CO \| EN \| ER \| CI \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 48.1 \| 61.2 \| 56.5 \| 72.5 \| 46.5 \| 61.0 \| 48.0 \| 69.5 \| 40.0 \| 22.0 \| 31.0 \| 86.5 \| 37.5 \| 37.0 \| 51.0 \| 45.0 \| 40.5 \| 39.0 \| 31.0 \| 43.5 \| 44.5 \|
	\| TRACE-uni \| 53.8 \| 68.1 \| 58.5 \| 72.5 \| 41.5 \| 73.5 \| 55.1 \| 71.5 \| 40.5 \| 25.0 \| 53.0 \| 88.5 \| 63.5 \| 38.5 \| 51.0 \| 52.5 \| 49.0 \| 59.5 \| 33.5 \| 49.5 \| 32.5 \|


	\| VideoMME (w/o subtitle) \| Short \| Midium \| Long \| Avg \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 49.5 \| 42.5 \| 39.3 \| 43.8 \|
	\| TRACE-uni \| 58.2 \| 48.1 \| 42.3 \| 49.6 \|

	#### Bibliography
	If you find this repository helpful for your project, please consider citing:
	```
	@misc{guo2024tracetemporalgroundingvideo,
	title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling},
	author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
	year={2024},
	eprint={2410.05643},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2410.05643},
	}
	```