outputs/qwen2.5-0.5b-ft

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on the OpenHermes 2.5 dataset. Thanks to Redmond.ai for the GPU support!

Model description

This model is based on Qwen2.5-0.5B, which is part of the latest series of Qwen large language models. Qwen2.5 brings significant improvements over its predecessor, including:

Enhanced knowledge and capabilities in coding and mathematics
Improved instruction following and long text generation (over 8K tokens)
Better understanding of structured data and generation of structured outputs (especially JSON)
Increased resilience to diverse system prompts
Long-context support up to 128K tokens with the ability to generate up to 8K tokens
Multilingual support for over 29 languages

The base Qwen2.5-0.5B model features:

Type: Causal Language Model
Training Stage: Pretraining
Architecture: Transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias, and tied word embeddings
Number of Parameters: 0.49B (0.36B non-embedding)
Number of Layers: 24
Number of Attention Heads (GQA): 14 for Q and 2 for KV
Context Length: Full 32,768 tokens

This fine-tuned version has been trained on the OpenHermes 2.5 dataset, which is a high-quality compilation of primarily synthetically generated instruction and chat samples, reaching 1M samples in total.

Intended uses & limitations

This model is intended for research and application in natural language processing tasks. It can be used for various downstream tasks such as text generation, language understanding, and potentially conversational AI after appropriate fine-tuning.

Limitations:

As a base language model, it is not recommended for direct use in conversations without further fine-tuning or post-training techniques like SFT or RLHF.
The model's performance may vary across different languages and domains.
Users should be aware of potential biases present in the training data.

Training and evaluation data

This model was fine-tuned on the OpenHermes 2.5 dataset, which is a continuation and significant expansion of the OpenHermes 1 dataset. It includes:

A diverse range of open-source datasets
Custom-created synthetic datasets
1 million primarily synthetically generated instruction and chat samples
High-quality, curated content that has contributed to the advancements in SOTA LLMs

The dataset is notable for its role in the development of the Open Hermes 2/2.5 and Nous Hermes 2 series of models.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 5
eval_batch_size: 5
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 40
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
num_epochs: 3
weight_decay: 0.01

Additional training details:

Gradient Checkpointing: Enabled
Mixed Precision: BF16 (auto)
Sequence Length: 4096
Sample Packing: Enabled
Pad to Sequence Length: Enabled

Framework versions

Transformers 4.45.0.dev0
Pytorch 2.3.1+cu121
Datasets 2.21.0
Tokenizers 0.19.1

Additional Information

This model was trained using the Axolotl framework. For more details on the base model, please refer to the Qwen2.5 blog, GitHub repository, and documentation.

To use this model, ensure you have the latest version of the Hugging Face transformers library (version 4.37.0 or later) to avoid compatibility issues.

For support and further development of open-source language models, consider supporting the creator of the OpenHermes dataset on GitHub Sponsors.

See axolotl config

axolotl version: 0.4.1

artificialguybr
/

Qwen2.5-0.5B-OpenHermes2.5