ctrltokyo/llama-2-7b-hf-dolly-flash-attention
This model is a fine-tuned version of NousResearch/Llama-2-7b-hf on the databricks/databricks-dolly-15k dataset with all training performed using Flash Attention 2.
No further testing or optimisation has been performed.
Model description
Just like ctrltokyo/llm_prompt_mask_fill_model, this model could be used for live autocompletion of PROMPTS, but is more designed for a generalized chatbot (hence the usage of the Dolly 15k dataset). Don't try this on code, because it won't work. I plan to release a further fine-tuned version using the code_instructions_120k dataset.
Intended uses & limitations
Use as intended.
Training and evaluation data
No evaluation was performed. Trained on NVIDIA A100, but appears to use around 20GB of VRAM when performing inference on the raw model.
Training procedure
The following bitsandbytes
quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: fp4
- bnb_4bit_use_double_quant: False
- bnb_4bit_compute_dtype: float32
Framework versions
- PEFT 0.4.0
- Downloads last month
- 13