Edit model card

Meta-Llama-3-8B-Instruct-FP8

Model Overview

  • Model Architecture: Meta-Llama-3
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
    • KV cache quantization: FP8
  • Intended Use Cases: Intended for commercial and research use in English. Similarly to Meta-Llama-3-8B-Instruct, this models is intended for assistant-like chat.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 6/8/2024
  • Version: 1.0
  • License(s): Llama3
  • Model Developers: Neural Magic

Quantized version of Meta-Llama-3-8B-Instruct.

lm_eval --model vllm --model_args pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8-K-V,kv_cache_dtype=fp8,add_bos_token=True --tasks gsm8k --num_fewshot 5 --batch_size auto

vllm (pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8-K-V,kv_cache_dtype=fp8,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7763|±  |0.0115|
Downloads last month
6
Safetensors
Model size
8.03B params
Tensor type
BF16
·
F8_E4M3
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.