How to process video as data loader

We assume that video is preprocessed in to image files in advance. Usually, we do not use all frames in a clip but sample a certain duration (e.g. 16 frames). The pipline we assume for each chunk is the following.

Get a list of images paths of clips e.g. ["./video/clip1/frame0.jpg",...,"./video/clip1/frame101.jpg"]
Sample a certain duration we want to use e.g. ["./video/clip1/frame11.jpg",...,"./video/clip1/frame26.jpg"]
Load each frames into a tensor shaped as (T, H, W, C). HW can be changed later.
Use torchvision builtin utilities to crop, flip, etc. For example,
- ToTensorVideo() from (T, H, W, C) to (C, T, H, W)), from 0-255 to 0-1 (devide by 225), and from uint8 to float.
- CenterCropVideo
- RandomHorizontalFlipVideo
- NormalizeVideo with kinetics mean and std -See more https://github.com/pytorch/vision/blob/f0d3daa7f65bcde560e242d9bccc284721368f02/torchvision/transforms/transforms_video.py

Note that the first part is different from what official pytorch repository ( https://github.com/pytorch/vision/tree/master/references/video_classification ) does. We don't use VideoClip class.