--- inference: false language: - en pipeline_tag: text-generation tags: - Yi - llama - llama-2 license: other license_name: yi-license license_link: LICENSE datasets: - jondurbin/airoboros-2.2.1 - lemonilia/LimaRP --- # airoboros-2.2.1-limarpv3-y34b-exl2 Exllama v2 quant of [Doctor-Shotgun/airoboros-2.2.1-limarpv3-y34b](https://huggingface.co/Doctor-Shotgun/airoboros-2.2.1-limarpv3-y34b) Branches: - main: measurement.json calculated at 2048 token calibration rows on PIPPA - 4.65bpw-h6: 4.65 decoder bits per weight, 6 head bits - ideal for 24gb GPUs at 8k context (on my 24gb Windows setup with flash attention 2, peak VRAM usage during inference with exllamav2_hf was around 23.4gb with 0.9gb used at baseline) - 6.0bpw-h6: 6 decoder bits per weight, 6 head bits - ideal for large (>24gb) VRAM setups