Duplicate from Ertugrul/Qwen2-VL-7B-Captioner-Relaxed

34706e0 verified 2 months ago

3.11 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2-VL-7B-Instruct
	pipeline_tag: image-to-text
	---

	# Qwen2-VL-7B-Captioner-Relaxed

	## Introduction

	Qwen2-VL-7B-Captioner-Relaxed is an instruction-tuned version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. This fine-tuned version is based on a hand-curated dataset for text-to-image models, providing significantly more detailed descriptions of given images.

	### Key Features:

	* Enhanced Detail: Generates more comprehensive and nuanced image descriptions.
	* Relaxed Constraints: Offers less restrictive image descriptions compared to the base model.
	* Natural Language Output: Describes different subjects in the image while specifying their locations using natural language.
	* Optimized for Image Generation: Produces captions in formats compatible with state-of-the-art text-to-image generation models.

	Note: This fine-tuned model is optimized for creating text-to-image datasets. As a result, performance on other tasks (e.g., ~10% decrease on mmmu_val) may be lower compared to the original model.

	## Requirements

	If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, try installing the latest version of the transformers library from source:

	`pip install git+https://github.com/huggingface/transformers`

	## Quickstart
	```python
	from PIL import Image
	from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
	from transformers import BitsAndBytesConfig
	import torch

	model_id = "Ertugrul/Qwen2-VL-7B-Captioner-Relaxed"

	model = Qwen2VLForConditionalGeneration.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_id)

	conversation = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]



	image = Image.open(r"PATH_TO_YOUR_IMAGE")

	# you can resize the image here if it's not fitting to vram, or set model max sizes.
	# image = image.resize((1024, 1024)) # like this

	text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

	inputs = processor(
	text=[text_prompt], images=[image], padding=True, return_tensors="pt"
	)
	inputs = inputs.to("cuda")

	with torch.no_grad():
	with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
	output_ids = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.7, use_cache=True, top_k=50)


	generated_ids = [
	output_ids[len(input_ids) :]
	for input_ids, output_ids in zip(inputs.input_ids, output_ids)
	]
	output_text = processor.batch_decode(
	generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
	)[0]
	print(output_text)
	```

	For more detailed options, refer to the [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) documentation.