input limit of model and rate limit of endpoint
HI: have been testing HF azure endpoint for this model (obi/deid_roberta_i2b2). Model is able to accept list of sentences as input, and noticed that HTTP errors occur when input list sizes are large. Wondering what limit of tokens for each sequence and limit of the length of list. Its HF model description page doesn't provide much details, any further documentation about model would be appreciated. Also for its endpoint, any request rate limit and if concurrent requests are allowed. Thanks.
Hi, the transformer model expects 512 tokens. The text input to the model should be such that, the resulting subword tokens should be less than 512. In our project we used a limit of 128 spacy tokens. The 128 spacy tokens on our dataset always resulted in less than 512 subword tokens. But you can modify your sequences so that the subword tokenization always results in less than 512 tokens - or you can use the truncate function in the subword tokenizer.
We believe the limit on the length of the input list will be dependent on your machine and available resources.
I hope this answered a few of your questions, let us know if there is anything else you want to clarify. We're working on improving the model and documentation!
HI @prajwal967 , thanks very much for the input! Does 128-token include 32 added-on tokens? We use scispacy "en_core_sci_lg" to sentencise text for endpoint inputs, in this case what token limit would you suggest to avoid exceeding 512 subword token limit? what parameters can be passed via HF azure endpoint to config this model? thanks!
Yes, the 128 tokens include the 32 added on tokens. Which means there will be 32 added on tokens on either side (hence 64 added on tokens in total) of the current sentence/chunk. If 128 tokens are exceeding the tokens limit, you can try setting the token limits to smaller values. For example:
max_tokens = 64
max_prev_sentence_token = 16
max_next_sentence_token = 16
default_chunk_size = 16
By HF azure endpoint, are you referring to the AutoModel from Hugging Face? If yes, then the parameters that can be passed are available on the Hugging Face documentation pages.
Let us know if you still have trouble getting this to work! Thanks!
thanks very much, @prajwal967 ! what compute instance specs in terms of vCPUs/memory would you suggest for model deployment for inference? thanks.
That would depend on a few parameter (for example batch size). But I think with a batch size of 1 or 2, you should be able to run it on CPU machine with about 16GB of RAM. You can go higher with the batch size if you have more compute resources/memory available.
noticed that in training 32 batch_size was used, but model batch_size not listed in config file here, wondering what default batch size is and if batch_size config would impact the limit on the length of the input list. thanks.
The batch size will not impact the length of the input. The batch size is a HuggingFace parameter. For the forward pass, you can find the parameter "per_device_eval_batch_size" listed in the config file.
Hope that helps!