Any tutorials to run the model and checkout the ppl?
Thanks!
This model checkpoint only can be used with Friendli Container. You can find the guide to pull and run Friendli Container at https://docs.friendli.ai/guides/container/running_friendli_container.
To calculate the PPL, you need to send the inference request to the serving endpoint created by Friendli Container. You will need to use options like include_output_logprobs
and forced_output_tokens
.forced_output_tokens
makes the serving engine generate your target tokens to compute their logprobs.
(https://docs.friendli.ai/openapi/create-completions)
Note that Friendli engine executes the actual (autoregressive) generation process. The process comprises multiple steps, where each step computes logprobs of a single next token.
This is different from and is slower than feeding an entire sequence and computing the logprobs of all of the tokens in a single step.
My trial:latest downloaded yesterday says it doesn't recognize dtype fp8.
How do I actually load / run this?
I'm actually interested in running 70B, but there weren't any posts there.
I have 3x RTX6000 Ada, CUDA 12.4, etc so should be good to go?
I'm looking for high-throughput batch biomedical text processing
Thanks.
DUH. RTFM, as they used to say. Never mind, found it.
Actually, haven't been able to get your fp8 example to work. too bad.