How to enable streaming?
#30
by
Alexander-Minushkin
- opened
generate here https://github.com/mistralai/mistral-inference/blob/main/src/mistral_inference/generate.py doesn't have streaming parameter.
If you use transformers, you can pass a streamer to the generate method >> start it in a thread >> stream the contents of the streamer. See the example for TextIteratorStreamer
https://huggingface.co/docs/transformers/internal/generation_utils
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
streamer = TextIteratorStreamer(tok)
# Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=20)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
generated_text += new_text
generated_text