--- base_model: LemiSt/SmolLM-135M-de library_name: transformers license: apache-2.0 tags: - axolotl - generated_from_trainer model-index: - name: SmolLM-135M-instruct-de-merged results: - task: type: text-generation dataset: name: openai/MMMLU type: mmlu metrics: - name: MMMLU(DE_DE) (0-Shot) type: MMMLU(DE_DE) (0-Shot) value: 25.57 verified: false - task: type: text-generation dataset: name: alexandrainst/m_arc type: arc metrics: - name: ARC Challenge (DE) (0-Shot) type: ARC Challenge (DE) (0-Shot) value: 24.29 verified: false - task: type: text-generation dataset: name: deutsche-telekom/Ger-RAG-eval type: Ger-RAG-eval metrics: - name: Ger-RAG-eval Choose Context By Question type: Ger-RAG-eval Task 1 value: 25.2 verified: false - name: Ger-RAG-eval Choose Question By Context type: Ger-RAG-eval Task 2 value: 27.1 verified: false - name: Ger-RAG-eval Context Question Match type: Ger-RAG-eval Task 3 value: 50.9 verified: false - name: Ger-RAG-eval Question Answer Match type: Ger-RAG-eval Task 4 value: 50.0 verified: false language: - de pipeline_tag: text-generation --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.4.1` ```yaml base_model: LemiSt/SmolLM-135M-de model_type: LlamaForCausalLM tokenizer_type: GPT2Tokenizer load_in_8bit: false load_in_4bit: true strict: false push_dataset_to_hub: datasets: - path: smollm_dataset.json type: sharegpt conversation: chatml chat_template: chatml default_system_prompt: "Du bist ein hilfreicher KI-Assistent." dataset_prepared_path: val_set_size: 0.05 adapter: qlora lora_model_dir: sequence_len: 2048 sample_packing: true lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: wandb_project: smollm-135m-de-sft-qlora wandb_entity: wandb_watch: wandb_name: wandb_log_model: output_dir: ./outputs/smollm-135m-sft-qlora-out hub_model_id: LemiSt/SmolLM-135M-instruct-de hub_strategy: end gradient_accumulation_steps: 16 micro_batch_size: 2 num_epochs: 2 optimizer: adamw_bnb_8bit torchdistx_path: lr_scheduler: cosine learning_rate: 0.003 train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true gptq_groupsize: gptq_model_v1: warmup_steps: 20 evals_per_epoch: 4 saves_per_epoch: 4 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<|endoftext|>" eos_token: "<|endoftext|>" unk_token: "<|endoftext|>" ```

# SmolLM-135M-instruct-de-merged This model is a fine-tuned version of [LemiSt/SmolLM-135M-de](https://huggingface.co/LemiSt/SmolLM-135M-de) on an internal testing dataset with general chat examples. It achieves the following results on the evaluation set: - Loss: 0.7453 ## Model description For more information, see the model card of the [base model](https://huggingface.co/LemiSt/SmolLM-135M-de). This adapter was trained using qlora at rank 32 with alpha 16, applying a dataset of around 200k german chat samples for two epochs. ## Intended uses & limitations Mainly playing around with tiny chat models - while the output is generally intact German and the model somewhat follows instructions, it makes too many mistakes to be deployed in a real world setting. ### Usage example ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM checkpoint = "LemiSt/SmolLM-135M-instruct-de-merged" tokenizer = AutoTokenizer.from_pretrained(checkpoint) device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=device, torch_dtype=torch.bfloat16) messages = [ {"role": "system", "content": "Du bist ein hilfreicher Assistent."}, {"role": "user", "content": "Wie viele Hände hat ein normaler Mensch?"} ] inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt", add_generation_prompt=True).to(device) outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.4, top_p=0.9, repetition_penalty=1.1, top_k=512) print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) ``` ## Training and evaluation data Internal dataset which was compiled for another experiment. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.003 - train_batch_size: 2 - eval_batch_size: 2 - seed: 42 - gradient_accumulation_steps: 16 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 20 - num_epochs: 2 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 1.6406 | 0.0005 | 1 | 1.6172 | | 0.8219 | 0.2497 | 501 | 0.8901 | | 0.8646 | 0.4995 | 1002 | 0.8370 | | 0.8651 | 0.7492 | 1503 | 0.8052 | | 0.7231 | 0.9989 | 2004 | 0.7827 | | 0.7632 | 1.2468 | 2505 | 0.7673 | | 0.7543 | 1.4967 | 3006 | 0.7536 | | 0.7782 | 1.7466 | 3507 | 0.7469 | | 0.6724 | 1.9966 | 4008 | 0.7453 | ### Framework versions - PEFT 0.12.0 - Transformers 4.45.0.dev0 - Pytorch 2.3.1+cu121 - Datasets 2.21.0 - Tokenizers 0.19.1