Multi-round conversation w/ PKV cache example code

by Xenova HF staff - opened May 18

May 18

Hi there! As seen in your README, the model seemingly supports multi-round conversations. Does this also work with passing past key values? If so, could you provide example code for this, as it will dramatically improve performance? Thanks!

qnguyen3

Owner May 18

Hi @Xenova , i honestly do not know the answer, i will look into it to see if it is possible.

Xenova

May 18

Great! It will greatly speed up time-to-first-token for the web demo I'm working on. If it doesn't work, then it's alright, it will produce the same results, just a bit slower since it needs to recompute KV cache on second run.

Xenova

May 18

Okay I've got it working! Currently doesn't work in transformers due to a bug here (it always just looks at the last token when past KV cache is passed in by the user, even when user specifies > 1 new input token).

I've updated this in transformers.js and will put out a demo with this.

Xenova

May 18

I've updated the model card + released the demo! :)

Model: https://huggingface.co/Xenova/nanoLLaVA
Demo: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu
Video:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment