README.md · meta-llama/Llama-3.1-405B-Instruct-FP8 at refs/pr/2

How to run it

There are two ways of running this models. Using Huggingface (with accelerate) or using vLLM.

Setup enviroment

For HF:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install fbgemm-gpu==0.8.0rc4

# Download the enablement fork, https://huggingface.co/sllhf/transformers_enablement_fork/tree/main unzip the file

cd transformers

# add changes from this PR https://github.com/huggingface/transformers/pull/32047
git fetch origin pull/32047/head:new-quant-method
git merge new-quant-method
pip install -e .

# Install accelerate from main
git clone https://github.com/huggingface/accelerate.git
cd accelerate
pip install -e .

For vLLM: install from main or use the nightly wheel: https://docs.vllm.ai/en/latest/getting_started/installation.html

Load back the HF model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sllhf/Meta-Llama-3.1-405B-Instruct-FP8"

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# make sure to set up your own params, temperature, top_p etc.

output = quantized_model.generate(**input_ids, max_new_tokens=10)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Run it with vLLM

Follow entrypoints in https://docs.vllm.ai/

For example:

from vllm import LLM
model = LLM("sllhf/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8, max_model_len=8192)
print(model.generate(["Hi there!"]))