yejingfu's picture
Update README.md
99f57f0 verified
metadata
language:
  - en
  - de
  - fr
  - it
  - pt
  - hi
  - es
  - th
license: llama3.1
pipeline_tag: text-generation
tags:
  - facebook
  - meta
  - pytorch
  - llama
  - llama-3

Meta-Llama3.1-8B-FP8-128K

Model Overview

  • Model Architecture: Meta-Llama-3.1
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
    • KV Cache quantization:FP8
  • Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 8/27/2024
  • Version: 1.0
  • License(s): llama3.1
  • Quantized version of Meta-Llama-3.1-8B-Instruct.

Serve with vLLM engine

python3 -m vllm.entrypoints.openai.api_server \
    --port <port> --model yejingfu/Meta-Llama-3.1-8B-Instruct-FP8-128K \
    --tensor-parallel-size 1 --swap-space 16 --gpu-memory-utilization 0.96 \
    --max-num-seqs 32 --max-model-len 131072 --kv-cache-dtype fp8 --enable-chunked-prefill

license: llama3.1