metadata

license: apache-2.0
datasets:
  - Mielikki/Erebus-87k
  - allura-org/r_shortstories_24k
language:
  - en
base_model: allura-org/TQ2.5-14B-Sugarquill-v1
library_name: transformers
pipeline_tag: text-generation
tags:
  - llama-cpp
  - gguf-my-repo

Triangle104/TQ2.5-14B-Sugarquill-v1-Q4_K_M-GGUF

This model was converted to GGUF format from allura-org/TQ2.5-14B-Sugarquill-v1 using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Model details:

Qwen2.5-14B Sugarquill v1

A continued pretrain of SuperNova-Medius on assorted short story data from the web. Supernova already had a nice prose, but diversifying it a bit definitely doesn't hurt. Also, finally a storywriter model with enough context for something more than a short story, that's also nice.

It's a fair bit more temperamental than Gemma, but can be tamed with some sampling. Instruction following also stayed rather strong, so it works for both RP and storywriting, both in chat mode via back-and-forth co-writing and on raw completion.

Overall, I'd say it successfully transfers the essence of what I liked about Gemma Sugarquill. I will also make a Qwen version of Aletheia, but with a brand new LoRA, based on a brand new RP dataset that's in the making right now.

Model was trained by Auri.

Training notes

This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. I've also normalized punctuation to ASCII on the train split, so mismatched quote marks should not be an issue anymore. Also normalized whitespaces, so double spaces after period should be gone as well.

It was trained on 5x3090Ti workstation for 7.5 hours with rsLoRA. I switched back to Axolotl for this run, as LF just plain refused to run at all on this workstation. Also, it's a bf16 LoRA this time. Overall training went much smoother than last time. I've attempted to train Qwen Sugarquill several times before, but loss jumped like crazy. Effective batch size of 40, rsLoRA and paged_ademamix_8bit optimizer seemingly completely solved this issue.

Thanks to Kearm for providing compute for this training run!

Format

Model responds to ChatML instruct formatting, exactly like it's base model.

<|im_start|>system {system message}<|im_end|> <|im_start|>user {user message}<|im_end|> <|im_start|>assistant {response}<|im_end|>

Recommended Samplers

I found this configuration to be quite stable:

Temperature - 0.8 Min-P - 0.05 Top-A - 0.3 Repetition Penalty - 1.03

Feel free to toy around with samplers after you get a feel for it. It seems to like Top-A and Smooth Sampling quite a bit.

Training config See Axolotl config

axolotl version: 0.4.1

Model

base_model: arcee-ai/SuperNova-Medius strict: false

Liger Kernels (optimization)

plugins:

axolotl.integrations.liger.LigerPlugin liger_rope: true liger_rms_norm: true liger_swiglu: true liger_fused_linear_cross_entropy: true

Output and HuggingFace

output_dir: /home/kearm/axolotl/TQ-2.5-14B-Sugarquill hub_model_id: allura-org/TQ-2.5-14B-Sugarquill-LoRA hf_use_auth_token: true hub_strategy: "all_checkpoints"

WandB

wandb_project: huggingface wandb_entity: wandb_name: TQ-2.5-14B-Sugarquill-1

Data

#chat_template: chatml #train_on_inputs: false group_by_length: false datasets:

path: allura-org/sugarquill-10k type: completion

Evaluation

val_set_size: 0.01 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128

Technical aspects

sequence_len: 8192 save_safetensors: true saves_per_epoch: 2 logging_steps: 1 special_tokens:

Quantization

bf16: auto fp16: tf32: false

For LoRA

load_in_8bit: false load_in_4bit: false

LoRA

peft_use_rslora: true peft_use_dora: false # better but slower adapter: lora # lora or qlora lora_model_dir: lora_r: 64 # 64 is optimal for most trains on instruct lora_alpha: 32 lora_dropout: 0.1 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules:

- embed_tokens

- lm_head

#loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable #loraplus_lr_embedding:

Training hyperparameters

max_steps:

num_epochs: 2

Anti Overfit and Stability

weight_decay: 0.01 max_grad_norm: 1.0

Learning Rate

warmup_ratio: 0.05 learning_rate: 0.00003 lr_scheduler: cosine #lr_scheduler_kwargs:

min_lr: 0.0000024

optimizer: paged_ademamix_8bit # usually adamw_torch or paged_adamw_8bit

Batch Size

gradient_accumulation_steps: 8 # More effective batch size - stabler train, usually. MBS also speeds it up. micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps eval_batch_size: 1

Optimizations

pad_to_sequence_len: true sample_packing: true eval_sample_packing: false flash_attention: true xformers_attention: gradient_checkpointing: "unsloth" gradient_checkpointing_kwargs: use_reentrant: true local_rank: deepspeed: /home/kearm/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all

fsdp:

- full_shard

- auto_wrap

fsdp_config:

fsdp_limit_all_gathers: true

fsdp_sync_module_states: true

fsdp_offload_params: true

fsdp_use_orig_params: false

fsdp_cpu_ram_efficient_loading: true

fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP

fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

fsdp_state_dict_type: FULL_STATE_DICT

fsdp_sharding_strategy: FULL_SHARD

Misc

early_stopping_patience: debug:

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Triangle104/TQ2.5-14B-Sugarquill-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-sugarquill-v1-q4_k_m.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Triangle104/TQ2.5-14B-Sugarquill-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-sugarquill-v1-q4_k_m.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Triangle104/TQ2.5-14B-Sugarquill-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-sugarquill-v1-q4_k_m.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo Triangle104/TQ2.5-14B-Sugarquill-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-sugarquill-v1-q4_k_m.gguf -c 2048