--- base_model: qnguyen3/nanoLLaVA language: - en library_name: transformers.js license: apache-2.0 pipeline_tag: image-text-to-text tags: - llava - multimodal - qwen --- https://huggingface.co/qnguyen3/nanoLLaVA with ONNX weights to be compatible with Transformers.js. ## Usage (Transformers.js) If you haven't already, you can install the [Transformers.js](https://huggingface.co/docs/transformers.js) JavaScript library from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using: ```bash npm i @huggingface/transformers ``` **Example:** ```js import { AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration, RawImage } from '@huggingface/transformers'; // Load tokenizer, processor and model const model_id = 'Xenova/nanoLLaVA'; const tokenizer = await AutoTokenizer.from_pretrained(model_id); const processor = await AutoProcessor.from_pretrained(model_id); const model = await LlavaForConditionalGeneration.from_pretrained(model_id, { dtype: { embed_tokens: 'fp16', // or 'fp32' or 'q8' vision_encoder: 'fp16', // or 'fp32' or 'q8' decoder_model_merged: 'q4', // or 'q8' }, // device: 'webgpu', }); // Prepare text inputs const prompt = 'What does the text say?'; const messages = [ { role: 'system', content: 'Answer the question.' }, { role: 'user', content: `\n${prompt}` } ] const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true }); const text_inputs = tokenizer(text); // Prepare vision inputs const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png'; const image = await RawImage.fromURL(url); const vision_inputs = await processor(image); // Generate response const { past_key_values, sequences } = await model.generate({ ...text_inputs, ...vision_inputs, do_sample: false, max_new_tokens: 64, return_dict_in_generate: true, }); // Decode output const answer = tokenizer.decode( sequences.slice(0, [text_inputs.input_ids.dims[1], null]), { skip_special_tokens: true }, ); console.log(answer); // The text reads "Small but mighty". const new_messages = [ ...messages, { role: 'assistant', content: answer }, { role: 'user', content: 'How does the text correlate to the context of the image?' } ] const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true }); const new_text_inputs = tokenizer(new_text); // Generate another response const output = await model.generate({ ...new_text_inputs, past_key_values, do_sample: false, max_new_tokens: 256, }); const new_answer = tokenizer.decode( output.slice(0, [new_text_inputs.input_ids.dims[1], null]), { skip_special_tokens: true }, ); console.log(new_answer); // The context of the image is that of a playful and humorous illustration of a mouse holding a weightlifting bar. The text "Small but mighty" is a playful reference to the mouse's size and strength. ``` **Demos:** We also released an online demo, which you can try yourself: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu --- Note: Having a separate repo for ONNX weights is intended to be a temporary solution until WebML gains more traction. If you would like to make your models web-ready, we recommend converting to ONNX using [🤗 Optimum](https://huggingface.co/docs/optimum/index) and structuring your repo like this one (with ONNX weights located in a subfolder named `onnx`).