BAAI
/

Update Readme on usage with Infinity

#36

Tested with large,base,v2

docker run --gpus all -v $PWD/data:/app/.cache -e HF_TOKEN=$HF_TOKEN -p "7993":"7997" michaelf34/infinity:0.0.68 v2 --model-id BAAI/bge-reranker-base --revision "main" --dtype float16 --batch-size 32 --engine torch --port 7997
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-11-13 00:59:19,095 infinity_emb INFO:        infinity_server.py:89
         Creating 1engines:                                                     
         engines=['BAAI/bge-reranker-base']                                     
INFO     2024-11-13 00:59:19,099 infinity_emb INFO: Anonymized   telemetry.py:30
         telemetry can be disabled via environment variable                     
         `DO_NOT_TRACK=1`.                                                      
INFO     2024-11-13 00:59:19,106 infinity_emb INFO:           select_model.py:64
         model=`BAAI/bge-reranker-base` selected, using                         
         engine=`torch` and device=`None`                                       
INFO     2024-11-13 01:00:12,731                             CrossEncoder.py:125
         sentence_transformers.cross_encoder.CrossEncoder                       
         INFO: Use pytorch device: cuda                                         
INFO     2024-11-13 01:00:13,625 infinity_emb INFO: Adding    acceleration.py:56
         optimizations via Huggingface optimum.                                 
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
INFO     2024-11-13 01:00:13,635 infinity_emb INFO: Switching to     torch.py:71
         half() precision (cuda: fp16).                                         
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO     2024-11-13 01:00:13,949 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=3                                                             
                 2.71     ms tokenization                                       
                 8.11     ms inference                                          
                 0.00     ms post-processing                                    
                 10.83    ms total                                              
         embeddings/sec: 2954.28                                                
INFO     2024-11-13 01:00:14,149 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=512                                                           
                 28.06    ms tokenization                                       
                 24.17    ms inference                                          
                 0.01     ms post-processing                                    
                 52.25    ms total                                              
         embeddings/sec: 612.50     

@Shitao Can you review?

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment