FriendliAI/Meta-Llama-3-8B-fp8 · Any tutorials to run the model and checkout the ppl?

LeiWang1999

May 3

Thanks!

FriendliAI org May 7

This model checkpoint only can be used with Friendli Container. You can find the guide to pull and run Friendli Container at https://docs.friendli.ai/guides/container/running_friendli_container.

To calculate the PPL, you need to send the inference request to the serving endpoint created by Friendli Container. You will need to use options like include_output_logprobs and forced_output_tokens.
forced_output_tokens makes the serving engine generate your target tokens to compute their logprobs.
(https://docs.friendli.ai/openapi/create-completions)

Note that Friendli engine executes the actual (autoregressive) generation process. The process comprises multiple steps, where each step computes logprobs of a single next token.
This is different from and is slower than feeding an entire sequence and computing the logprobs of all of the tokens in a single step.

bdambrosio

May 8

My trial:latest downloaded yesterday says it doesn't recognize dtype fp8.
How do I actually load / run this?
I'm actually interested in running 70B, but there weren't any posts there.
I have 3x RTX6000 Ada, CUDA 12.4, etc so should be good to go?

I'm looking for high-throughput batch biomedical text processing

Thanks.

bdambrosio

May 8

DUH. RTFM, as they used to say. Never mind, found it.

bdambrosio

May 8

Actually, haven't been able to get your fp8 example to work. too bad.