Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.6.0
SAT CogVideoX-2B
This folder contains the inference code using SAT weights and the fine-tuning code for SAT weights.
This code is the framework used by the team to train the model. It has few comments and requires careful study.
Inference Model
- Ensure that you have correctly installed the dependencies required by this folder.
pip install -r requirements.txt
- Download the model weights
First, go to the SAT mirror to download the dependencies.
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
mv 'index.html?dl=1' vae.zip
unzip vae.zip
wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
mv 'index.html?dl=1' transformer.zip
unzip transformer.zip
Then unzip, the model structure should look like this:
.
βββ transformer
β βββ 1000
β β βββ mp_rank_00_model_states.pt
β βββ latest
βββ vae
βββ 3d-vae.pt
Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.
git lfs install
git clone https://huggingface.co/google/t5-v1_1-xxl.git
We don't need the tf_model.h5 file. This file can be deleted.
- Modify the file
configs/cogvideox_2b_infer.yaml
.
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "google/t5-v1_1-xxl" ## T5 model path
max_length: 226
first_stage_config:
target: sgm.models.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE model path
- If using txt to save multiple prompts, please refer to
configs/test.txt
for modification. One prompt per line. If you don't know how to write prompts, you can first use this code to call LLM for refinement. - If using the command line as input, modify
input_type: cli
so that prompts can be entered from the command line.
If you want to change the output video directory, you can modify:
output_dir: outputs/
The default is saved in the .outputs/
folder.
- Run the inference code to start inference
bash inference.sh
Fine-Tuning the Model
Preparing the Dataset
The dataset format should be as follows:
.
βββ labels
β βββ 1.txt
β βββ 2.txt
β βββ ...
βββ videos
βββ 1.mp4
βββ 2.mp4
βββ ...
Each txt file should have the same name as its corresponding video file and contain the labels for that video. Each video should have a one-to-one correspondence with a label. Typically, a video should not have multiple labels.
For style fine-tuning, please prepare at least 50 videos and labels with similar styles to facilitate fitting.
Modifying the Configuration File
We support both Lora
and full-parameter fine-tuning
methods. Please note that both fine-tuning methods only apply to the transformer
part. The VAE part
is not modified. T5
is only used as an Encoder.
the configs/cogvideox_2b_sft.yaml
(for full fine-tuning) as follows.
# checkpoint_activations: True ## using gradient checkpointing (both checkpoint_activations in the configuration file need to be set to True)
model_parallel_size: 1 # Model parallel size
experiment_name: lora-disney # Experiment name (do not change)
mode: finetune # Mode (do not change)
load: "{your_CogVideoX-2b-sat_path}/transformer" # Transformer model path
no_load_rng: True # Whether to load the random seed
train_iters: 1000 # Number of training iterations
eval_iters: 1 # Number of evaluation iterations
eval_interval: 100 # Evaluation interval
eval_batch_size: 1 # Batch size for evaluation
save: ckpts # Model save path
save_interval: 100 # Model save interval
log_interval: 20 # Log output interval
train_data: [ "your train data path" ]
valid_data: [ "your val data path" ] # Training and validation sets can be the same
split: 1,0,0 # Ratio of training, validation, and test sets
num_workers: 8 # Number of worker threads for data loading
If you wish to use Lora fine-tuning, you also need to modify:
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
not_trainable_prefixes: [ 'all' ] ## Uncomment
log_keys:
- txt'
lora_config: ## Uncomment
target: sat.model.finetune.lora2.LoraMixin
params:
r: 256
Fine-Tuning and Validation
- Run the inference code to start fine-tuning.
bash finetune.sh
Converting to Huggingface Diffusers Supported Weights
The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run:
python ../tools/convert_weight_sat2hf.py
Note: This content has not yet been tested with LORA fine-tuning models.