File size: 11,275 Bytes
ac550ec
cd32d7b
 
 
51af3eb
 
 
ac550ec
cd32d7b
 
51af3eb
ac550ec
cd32d7b
51af3eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd32d7b
 
6be402c
cd32d7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f1a196
cd32d7b
 
 
da66119
cd32d7b
 
 
7448d65
cd32d7b
 
 
 
 
7448d65
cd32d7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7448d65
cd32d7b
 
 
 
 
 
 
 
 
 
 
 
 
7448d65
da66119
cd32d7b
 
 
 
 
 
 
 
 
da66119
7448d65
da66119
 
 
 
 
 
 
7448d65
da66119
 
 
 
 
 
 
 
 
 
 
cd32d7b
7448d65
da66119
311229b
cd32d7b
 
 
7448d65
cd32d7b
 
 
 
7448d65
cd32d7b
7448d65
cd32d7b
da66119
7448d65
da66119
cd32d7b
311229b
7448d65
 
 
ac4aaf5
4f1a196
 
 
 
 
da66119
4f1a196
 
 
 
 
 
 
da66119
cd32d7b
 
963e121
 
d503134
 
 
 
 
 
963e121
 
51af3eb
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
language: en
tags:
- tvp
- intel
- cvpr
- charades
license: other
datasets:
- charades
library_name: transformers
---

# TVP base model

| Model Detail | Description |
| ----------- | ----------- | 
| Model Authors | Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding | 
| Date | 2023 | 
| Version | Base | 
| Type | Text-Visual Prompting for Temporal Video Grounding | 
| Paper or Other Resources | Base model: [mosaicml/mpt-7b](https://huggingface.co/mosaicml/mpt-7b); Dataset: [Charades](https://prior.allenai.org/projects/charades) | 
| License | Other |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/tvp-base/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)|

| Intended Use | Description |
| ----------- | ----------- | 
| Primary intended uses | The TVP model is designed for temporal video grounding (TVG), specifically to predict the start and end times of moments described by a text sentence within a long, untrimmed video. | 
| Primary intended users | Researchers and developers working in the field of computer vision, particularly those focused on video understanding and cross-modal (text and video) tasks. | 
| Out-of-scope uses | The model is not intended for real-time video processing or applications requiring 3D visual features extraction due to its design for efficiency with 2D features.  |


# Factors
Relevant factors: The model's performance may vary across different video content, such as variations in video quality, lighting conditions, or genres (e.g., action vs. dialogue-heavy scenes).
Evaluation factors: Performance has been evaluated on benchmark datasets like Charades-STA and ActivityNet Captions, focusing on metrics relevant to temporal video grounding accuracy. 

# Metrics

Model performance measures: The model employs metrics such as the Temporal-Distance IoU (TDIoU) loss for efficient learning and performance evaluation in TVG tasks.

Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.

# Training Data

The TVP model was pretrained on public datasets such as Charades.

Charades is dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. 267 different users were presented with a sentence, that includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence (like in a game of Charades). The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. This work was presented at ECCV2016.

Each video has been exhaustively annotated using consensus from 4 workers on the training set, and from 8 workers on the test set. Please refer to the updated accompanying publication for details. Please contact [email protected] for questions about the dataset.

# Quantitative Analyses

Unitary results: Refer to Table 2 in the provided paper for TVP's performance on the Temporal Video Grounding task.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e1cfa7f9927d9455acdc72/WOeve3VDZU2WvoXfvoK5X.png)


# TVP base model

The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems. It was introduced in
[this paper](https://arxiv.org/pdf/2303.04995.pdf).

TVP got accepted to [CVPR'23](https://cvpr2023.thecvf.com/) conference.

## Model description

The abstract from the paper is the following:
In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.

## Intended uses & limitations(TODO)

You can use the raw model for temporal video grounding.

### How to use

Here is how to use this model to get the logits of a given video and text in PyTorch:
```python
import av
import cv2
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor, TvpForVideoGrounding


def pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps):
    '''
    Convert the video from its original fps to the target_fps and decode the video with PyAV decoder.
    Returns:
        frames (tensor): decoded frames from the video. Return None if the no
            video stream was found.
        fps (float): the number of frames per second of the video.
    '''
    fps = float(container.streams.video[0].average_rate)
    clip_size = sampling_rate * num_frames / target_fps * fps
    delta = max(container.streams.video[0].frames - clip_size, 0)
    start_idx = delta * clip_idx / num_clips
    end_idx = start_idx + clip_size - 1
    timebase = container.streams.video[0].duration / container.streams.video[0].frames
    video_start_pts = int(start_idx * timebase)
    video_end_pts = int(end_idx * timebase)
    stream_name = {"video": 0}
    seek_offset = max(video_start_pts - 1024, 0)
    container.seek(seek_offset, any_frame=False, backward=True, stream=container.streams.video[0])
    frames = {}
    for frame in container.decode(**stream_name):
        if frame.pts < video_start_pts:
            continue
        if frame.pts <= video_end_pts:
            frames[frame.pts] = frame
        else:
            frames[frame.pts] = frame
            break
    frames = [frames[pts] for pts in sorted(frames)]
    return frames, fps


def decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps):
    '''
    Decode the video and perform temporal sampling.
    Args:
        container (container): pyav container.
        sampling_rate (int): frame sampling rate (interval between two sampled frames).
        num_frames (int): number of frames to sample.
        clip_idx (int): if clip_idx is -1, perform random temporal sampling.
            If clip_idx is larger than -1, uniformly split the video to num_clips
            clips, and select the clip_idx-th video clip.
        num_clips (int): overall number of clips to uniformly sample from the given video.
        target_fps (int): the input video may have different fps, convert it to
            the target video fps before frame sampling.
    Returns:
        frames (tensor): decoded frames from the video.
    '''
    assert clip_idx >= -2, "Not a valied clip_idx {}".format(clip_idx)
    frames, fps = pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps)
    clip_size = sampling_rate * num_frames / target_fps * fps
    index = torch.linspace(0, clip_size - 1, num_frames)
    index = torch.clamp(index, 0, len(frames) - 1).long().tolist()
    frames = [frames[idx] for idx in index]
    frames = [frame.to_rgb().to_ndarray() for frame in frames]
    frames = torch.from_numpy(np.stack(frames))
    return frames

def get_resize_size(image, max_size):
    '''
    Args:
        image: np.ndarray
        max_size: The max size of height and width
    Returns:
        (height, width)
    Note the height/width order difference >>> pil_img = Image.open("raw_img_tensor.jpg") >>> pil_img.size (640,
    480) # (width, height) >>> np_img = np.array(pil_img) >>> np_img.shape (480, 640, 3) # (height, width, 3)
    '''
    height, width = image.shape[-2:]
    if height >= width:
        ratio = width * 1.0 / height
        new_height = max_size
        new_width = new_height * ratio
    else:
        ratio = height * 1.0 / width
        new_width = max_size
        new_height = new_width * ratio
    size = {"height": int(new_height), "width": int(new_width)}
    return size

file = hf_hub_download(repo_id="Intel/tvp_demo", filename="AK2KG.mp4", repo_type="dataset")
model = TvpForVideoGrounding.from_pretrained("Intel/tvp-base")

decoder_kwargs = dict(
    container=av.open(file, metadata_errors="ignore"),
    sampling_rate=1,
    num_frames=model.config.num_frames,
    clip_idx=0,
    num_clips=1,
    target_fps=3,
)
raw_sampled_frms = decode(**decoder_kwargs).permute(0, 3, 1, 2)

text = "a person is sitting on a bed."
processor = AutoProcessor.from_pretrained("Intel/tvp-base")
size = get_resize_size(raw_sampled_frms, model.config.max_img_size)
model_inputs = processor(
    text=[text], videos=list(raw_sampled_frms.numpy()), return_tensors="pt", max_text_length=100, size=size
)

model_inputs["pixel_values"] = model_inputs["pixel_values"].to(model.dtype)
model_inputs["labels"] = torch.tensor([18.1, 0.0, 6.8])
output = model(**model_inputs)
print(f"The model's output is {output}")

def get_video_duration(filename):
    cap = cv2.VideoCapture(filename)
    if cap.isOpened():
        rate = cap.get(5)
        frame_num = cap.get(7)
        duration = frame_num/rate
        return duration
    return -1

duration = get_video_duration(file)
timestamp = output['logits'].tolist()
start, end = round(timestamp[0][0]*duration, 1), round(timestamp[0][1]*duration, 1)
print(f"The time slot of the video corresponding to the text \"{text}\" is from {start}s to {end}s")
```

### BibTeX entry and citation info
```bibtex
@inproceedings{zhang2023text,
  title={Text-visual prompting for efficient 2d temporal video grounding},
  author={Zhang, Yimeng and Chen, Xin and Jia, Jinghan and Liu, Sijia and Ding, Ke},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14794--14804},
  year={2023}
}
```

Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.