Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
MoritzLaurer 
posted an update 6 days ago
Post
1261
The new NIM Serverless API by HF and Nvidia is a great option if you want a reliable API for open-weight LLMs like Llama-3.1-405B that are too expensive to run on your own hardware.

- It's pay-as-you-go, so it doesn't have rate limits like the standard HF Serverless API and you don't need to commit to hardware like for a dedicated endpoint.
- It works out-of-the box with the new v0.25 release of our huggingface_hub.InferenceClient
- It's specifically tailored to a small collection of popular open-weight models. For a broader selection of open models, we recommend using the standard HF Serverless API.
- Note that you need a token from an Enterprise Hub organization to use it.

Details in this blog post: https://huggingface.co/blog/inference-dgx-cloud
Compatible models in this HF collection: nvidia/nim-serverless-inference-api-66a3c6fcdcb5bbc6e975b508
Release notes with many more features of huggingface_hub==0.25.0: https://github.com/huggingface/huggingface_hub/releases/tag/v0.25.0

Copy-pasteable code in the first comment:
#!pip install "huggingface_hub>=0.25.0"
from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://huggingface.co/api/integrations/dgx/v1",
    api_key="MY_FINEGRAINED_ENTERPRISE_ORG_TOKEN"  # see docs: https://huggingface.co/blog/inference-dgx-cloud#create-a-fine-grained-token
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    max_tokens=1024,
)

print(output)

Very exciting to see this! I often want to use an LLM for a short period, and setting up a whole endpoint for this can be overkill. This seems like a very neat solution!

Do you think there is a chance that any VLMs will be added soon!?