Update configuration_nvembed & infinity doc
#53
by
michaelfeil
- opened
Tested:
docker run -it -e HF_TOKEN=$HF_TOKEN --gpus "0,1" -v ./data:/app/.cache -p 7997:7997 michaelf34/infinity:0.0.70 v2 --model-id nvidia/NV-Embed-v1 --revision "refs/pr/53" --batch-size 8
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-18 22:11:48,826 infinity_emb INFO: Creating 1engines: engines=['nvidia/NV-Embed-v1'] infinity_server.py:92
INFO 2024-11-18 22:11:48,830 infinity_emb INFO: Anonymized telemetry can be disabled via environment variable `DO_NOT_TRACK=1`. telemetry.py:30
INFO 2024-11-18 22:11:48,838 infinity_emb INFO: model=`nvidia/NV-Embed-v1` selected, using engine=`torch` and device=`None` select_model.py:64
INFO 2024-11-18 22:11:48,976 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: nvidia/NV-Embed-v1 SentenceTransformer.py:216
INFO 2024-11-18 22:11:56,462 infinity_emb INFO: Adding optimizations via Huggingface optimum. acceleration.py:56
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
WARNING 2024-11-18 22:11:56,465 infinity_emb WARNING: BetterTransformer is not available for model: <class 'transformers_modules.nvidia.NV-Embed-v1.5c4350bceb31cb14881e0c80c372eeeabd2b257f.modeling_nvembed.NVEmbedModel'> Continue without bettertransformer acceleration.py:67
modeling code.
/usr/lib/python3.10/contextlib.py:103: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
INFO 2024-11-18 22:11:57,800 infinity_emb INFO: Getting timings for batch_size=8 and avg tokens per sentence=2 select_model.py:97
1.26 ms tokenization
100.00 ms inference
0.18 ms post-processing
101.45 ms total
embeddings/sec: 78.86
INFO 2024-11-18 22:11:59,621 infinity_emb INFO: Getting timings for batch_size=8 and avg tokens per sentence=513 select_model.py:103
4.60 ms tokenization
893.86 ms inference
0.24 ms post-processing
898.70 ms total
embeddings/sec: 8.90
INFO 2024-11-18 22:11:59,623 infinity_emb INFO: model warmed up, between 8.90-78.86 embeddings/sec at batch_size=8 select_model.py:104
INFO 2024-11-18 22:11:59,625 infinity_emb INFO: creating batching engine batch_handler.py:443
INFO 2024-11-18 22:11:59,629 infinity_emb INFO: ready to batch requests. batch_handler.py:512
INFO 2024-11-18 22:11:59,632 infinity_emb INFO: infinity_server.py:106
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.70
Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs
Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models
Visit the docs for more information:
https://michaelfeil.github.io/infinity
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)