Spaces:
Runtime error
how many tokens generator per second about inference llama-13b-hf on A10G
which generate parameter is the most important to accelerate inference speed
max_length is 1000 ? it seem could be very slow!
generation_config:
temperature: 0.90
top_p: 0.75
num_beams: 1
use_cache: True
max_length: 1000
min_length: 0
that is not the case.
unfortunately, Hugging Face library does not support streaming generation at the moment, so one should write a sort of money patch to enable it. The parameters you posted are used in batch generation mode which is not the case of this space.
instruction_prompt, max_tokens=128, temperature=1, top_p=0.9, cache=True
these are the only parameters supported in streaming mode at the moment. The generation could be faster if I remove the window size to look back the conversation history.
Also, note that what you see from the chat UI is 3 tokens generated at a time. I aggregate n number of tokens into a chunk and yield
. This is an experiment to see if yielding every tokens is costly.
Also, one could build an application like the other request take its turn while the generated a chunk of tokens from the previous request are being yielded if I can set asyncio.sleep(0.01)
or something. However Gradio does not support async
generator at the moment.
how many in local mode? I am hoping to test it with 512 tokens after buying colab subscription, but i would like to know if can try 1024 or 2048 tokens?