Workflow for fine-tuning ModelScope in anime style
Here is a brief description of my process for fine-tuning ModelScope in an anime style with Text-To-Video-Finetuning. Most of it may be basic, but I hope it will be useful. There is no guarantee that what is written here is correct and will lead to good results!
Selection of training data
The goal of my training was to change the model to an overall anime style. Only the art style was to override the ModelScope content, so I did not need a huge data set. The total number of videos and images was only a few thousand. Most of the video was taken from Tenor. Many of the videos were posted as gifs and mp4s of one short scene. It seems to be possible to automate the process using the API.
I also used some smooth and stable motions and videos of 3d models with toon shading. Short videos with a few seconds are sufficient, as we are not able to train long data yet.
Notes on data collection
Blurring and noise are also trained. This is especially noticeable in the case of high-resolution training. Frame rate also has an effect. If you want to train smooth motion, you need such data. Scene switching also has an effect. If not addressed, the character may suddenly transform. In the case of animation training, it is difficult to express details if only video sources are used, so images are also used for training. Images can be created using StableDiffusion. The fewer the differences between frames, the less likely the training results will be corrupted. I avoided animations with too much dynamic motion. It may be better to avoid scenes with multiple contexts and choose scenes with simple actions. I collected data while checking if common emotions and actions were included.
Correcting data before training
Fixing resolution, blurring, and noise
It is safe to use a resolution at least equal to or higher than the training resolution. The ratio should also match the training settings. Trimming is possible with ffmpeg. Incidentally, I have tried padding to ratio with a single color instead of trimming, but it seemed to decrease the training speed.
Converting small videos to larger sizes
I used this tool: https://github.com/k4yt3x/video2x The recommended driver is Waifu2XCaffe. It is suitable for animation as it gets clear and sharp results. It also reduces noise a little. If you cannot improve the image quality as well as the resolution, it may be better not to force a higher resolution.
Number of frames
Since many animations have a small number of frames, the results of the training are likely to be collapsed. In addition to body collapse, the appearance of the character will no longer be consistent. Less variation between frames seems to improve consistency. The following tool may be useful for frame interpolation https://github.com/google-research/frame-interpolation. If the variation between frames is too large, you will not get a clean result.
Tagging
For anime, WaifuTagger can extract content with good accuracy, so I created a slightly modified script for video and used it for animov512x. Nevertheless, BLIP2-Preprocessor can also extract enough general scene content. It may be a better idea to use them together.
config.yaml settings
I'm still not quite sure what is appropriate for this.
config.yaml for animov512x
Evaluate training results
If there are any poorly done results in the sample videos being trained, we will search the json with the prompts for that sample. With a training dataset of a few thousand or so, you can usually find the training source videos, which may be helpful to see where the problem lies. I dared to train all videos with 'anime' tags. Comparing videos with the positive prompts and negative ones with anime tag after training (comparing a fine-tuned result with those that are near to the original ModelScope) may help improve training.
It is difficult to add additional training to specific things afterwards, even if they are tagged, so I avoided that. Note that the number of frames in anime is small to begin with, so over-learning tends to freeze the characters.
Perhaps it is because ModelScope itself is not trained at such a large resolution, but the training difficulty seems to be lower at lower resolutions. In fact, when training Animov-0.1, I did not need to pay much attention to what is written here to get good results. If you are fine-tuning ModelScope at larger resolutions, you may need to train incrementally with more data to avoid collapsing the results.
That's all.