--- license: apache-2.0 language: - en --- # Llama-3.2-11B-Vision-Instruct This is a model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation. ## Model Description This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including: * **Image captioning:** Generating descriptive captions for images. * **Visual question answering:** Answering questions about the content of images. * **Image-based dialogue:** Engaging in conversations based on visual input. ## Intended Uses & Limitations This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions. **Limitations:** * The model may not always accurately interpret the content of images. * It may be biased towards certain types of images or concepts. * It may generate inappropriate or offensive content. ## How to Use Here's an example of how to use this model in Python with the `transformers` library: ```python import gradio as gr from transformers import AutoProcessor, MllamaForConditionalGeneration # Use GPU if available, otherwise CPU device = "cuda" if torch.cuda.is_available() else "cpu" # Load the model and processor model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = MllamaForConditionalGeneration.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", ) # Function to generate model response def predict(message, image): messages = [{"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": message} ]}] input_text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(image, input_text, return_tensors="pt").to(device) response = model.generate(**inputs, max_new_tokens=100) return processor.decode(response[0], skip_special_tokens=True) # Gradio interface with gr.Blocks() as demo: gr.Markdown("# Simple Multimodal Chatbot") with gr.Row(): with gr.Column(): # Message input on the left text_input = gr.Textbox(label="Message") submit_button = gr.Button("Send") with gr.Column(): # Image input on the right image_input = gr.Image(type="pil", label="Upload an Image") chatbot = gr.Chatbot() # Chatbot output at the bottom def respond(message, image, history): history = history + [(message, "")] response = predict(message, image) history[-1] = (message, response) return history submit_button.click( fn=respond, inputs=[text_input, image_input, chatbot], outputs=chatbot ) demo.launch() ``` This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs. ## More Information For more details and examples, please visit [ruslanmv.com](https://ruslanmv.com). ## License This model is licensed under the [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).