Model Card for EnDe-chat-0.0.7
Preliminary LoRA finetune of Mistral-7B for German and English quality text.
This version has an extended tokenizer, to make the model able to handle longer input.
This is an experiment to improve the German capabilities of Mistral with continued finetuning. The finetuning also includes English data, in order to retain the English capabilities, to allow the model to be used for translation and for answering German questions on English documents and vice versa.
Unfortunately, the compute available for this experiment (2xV100) was not at all sufficient for the amount of training data we would have liked to include.
After continued pretraining, this model has received instruction finetuning.
Table of Contents
Model Details
Model Description
LoRA finetune of Mistral-7B for German and English quality text.
- Developed by: Erich Schubert
- Model type: Language model
- Language(s) (NLP): deu, eng
- License: apache-2.0
- Parent Model: mistralai/Mistral-7B-v0.1
- Resources for more information: n/a
Uses
Model finetuned for chat in German and English.
Out-of-Scope Use
The model has not received alignment or instruction finetuning, this is intended as a chat foundation model.
Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
Recommendations
Further finetuning necessary!
Training Details
Training Data
Pretrained on proprietary text collected from the internet, with a focus on quality German and English text.
Typical benchmarking data should not be present in this data set.
This is no longer as clear for the finetuning data sets, but the amount of data and compute for instruction tuning was much less.
Training Procedure
Initial LoRA finetuning with LLaMA-Factory using a mixture of English and German data, with a focus on data quality.
Unfortunately, I could use 100x as much GPU power as I had available for this experiment, and had to heavily subsample the data. As is, this is largely a proof of concept to see if we can improve model quality with better data.
This version then received basic chat/instruction training with
--stage sft \
--model_name_or_path ende-0.0.7 \
--finetuning_type lora \
--template default \
--dataset_dir data \
--dataset sharegpt-deutsch,oasst_de,dolly_15k_de,openschnabeltier_de,ultrachat_de,evol_instruct,evol_instruct_de,alpaca-gpt4_de,dolphin_de,airoboros_de \
--cutoff_len 1024 \
--learning_rate 5e-05 \
--num_train_epochs 1.0 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--neftune_noise_alpha 0 \
--lora_target all \
--lora_rank 8 \
--lora_dropout 0 \
--fp16 True \
Unfortunately, most of this fine-tuning data is just automatically translated from English. I do not think this leads to particularly high-quality data.
Evaluation
Not fully evaluated, as it has not been completely trained.
Also, I believe that our benchmarks tend to be misleading. In particular the huggingface leaderboard is flooded with overfitted models with little to no value. Real-world performance may be task specific and needs to be evaluated carefully on a case basis. I hope some will find this model to be useful!
You are welcome to contribute evaluation scores!
- Downloads last month
- 12