How to fully fine-tune Mixtral 8x7b without using any adapters ?
#52
by
cuongk14
- opened
Hi,
I have tried different ways (deepspeed and FSDP) on a clusters with 10 A100 80GB but it always ended up with out of memory issues. Anyone here successfully fine-tuned the model without using any popular adapters like QLORA. The following is my Deepspeed config:
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu"
},
"offload_optimizer": {
"device": "cpu"
}
},
"gradient_accumulation_steps":"auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_clipping": 1.0,
"fp16": {
"enabled": true
}
}```
And the following is a simple script train.py (using batch-size = 1, and very small context length to test).
``` import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from transformers import TrainerCallback
import torch
import transformers
from trl import SFTTrainer
def main():
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-step-50K-105b')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Load dataset from the Hugging Face datasets library
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
# Tokenize the texts
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Load the data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False, # TinyLlama uses a causal (not masked) language model, similar to GPT-2
)
# Load the model
model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-v0.1',
torch_dtype=torch.bfloat16
)
model.resize_token_embeddings(len(tokenizer))
# Define the training arguments
training_args = TrainingArguments(
optim="adamw_torch",
save_strategy="epoch",
output_dir="./result",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=1,
save_steps=10_000,
save_total_limit=2,
fp16=True,
deepspeed="zero3.json", # Path to DeepSpeed config file
gradient_checkpointing=True,
report_to='wandb'
)
trainer = SFTTrainer(
model=model,
args=training_args,
data_collator=data_collator,
dataset_text_field="text",
train_dataset= dataset["train"],
eval_dataset= dataset["validation"]
)
trainer.train()
if __name__ == "__main__":
main() ```
Here is command to execute the above train.py
` deepspeed --include localhost:0,1,2,3,4,5,6,7,8,9 train.py `
Hi
@cuongk14
!
You might be interested in GaLore algorithm: https://huggingface.co/docs/transformers/v4.40.0/en/trainer#galore that enables parameter efficient full pre-training & fine-tuning. Note this does not support DeepSpeed yet