Update README.md

b26a3ba verified 4 months ago

4.2 kB

	---
	tags:
	- visual-question-answering
	license: mit
	widget:
	- text: what fabric is the lower cloth made of?
	src: >-
	https://assets.myntassets.com/v1/images/style/properties/7a5b82d1372a7a5c6de67ae7a314fd91_images.jpg
	- text: is there a hat worn?
	src: >-
	https://assets.myntassets.com/v1/images/style/properties/fee54b57fcd02b7c07d42b0918025099_images.jpg
	---
	# FaVQA - Fashion-related Visual Question Answering

	<!-- Provide a quick summary of what the model is/does. -->

	### Summary

	A Vision-and-Language Pre-training (VLP) model for a fashion-related downstream task, Visual Question Answering (VQA). The related model, ViLT, was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) and incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for VLP.

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Model type: Vision Question Answering, ViLT
	- License: MIT
	<!-- - : [dandelin/vilt-b32-finetuned-vqa](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) -->
	- Train/test dataset: [yanka9/deepfashion-for-VQA](https://huggingface.co/datasets/yanka9/deepfashion-for-VQA), derived from [DeepFashion](https://github.com/yumingj/DeepFashion-MultiModal?tab=readme-ov-file)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Demo: [🤗 Space](https://huggingface.co/spaces/yanka9/fashion-vqa)


	## How to Get Started with the Model

	Use the code below to get started with the model. It's similar to original model.
	```
	from transformers import ViltProcessor, ViltForQuestionAnswering
	import requests
	from PIL import Image

	# prepare image + question
	image = Image.open(YOUR_IMAGE)
	text = "how long is the sleeve?"

	processor = ViltProcessor.from_pretrained("yanka9/vilt_finetuned_deepfashionVQA_v2")
	model = ViltForQuestionAnswering.from_pretrained("yanka9/vilt_finetuned_deepfashionVQA_v2")

	# prepare inputs
	encoding = processor(image, text, return_tensors="pt")

	# forward pass
	outputs = model(**encoding)
	logits = outputs.logits
	idx = logits.argmax(-1).item()
	print("Answer:", model.config.id2label[idx])
	```

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	A custom training dataset was developed for training the ViLT classifier. It was derived from DeepFashion-MultiModal, which is a large-scale high-quality human dataset with rich multi-modal annotations. It contains 44,096 high-resolution human images, including 12,701 full-body human images. For each full body image, the authors manually annotate the human parsing labels of 24 classes.

	It has several other properties, but for the scope of this project, only the full body images and labels were utilized to generate the training dataset. Moreover, the labels encompass at least one category of the following: fabric, color, and shape. 209481 questions were generated for 44096 images, the categories used for training are listed below.

	```
	'Color.LOWER_CLOTH',
	'Color.OUTER_CLOTH',
	'Color.UPPER_CLOTH',
	'Fabric.OUTER_CLOTH',
	'Fabric.UPPER_CLOTH',
	'Gender',
	'Shape.CARDIGAN',
	'Shape.COVERED_NAVEL',
	'Shape.HAT',
	'Shape.LOWER_CLOTHING_LENGTH',
	'Shape.NECKWEAR',
	'Shape.RING',
	'Shape.SLEEVE',
	'Shape.WRISTWEAR'
	```


	### Question Types

	The model supports both open and close-ended (yes or no) questions. Below one may find examples from the training phase generated questions.

	```
	'how long is the sleeve?',
	'what is the length of the lower clothing?',
	'how would you describe the color of the upper cloth?',
	'whats is the color of the lower cloth?'
	'what fabric is the upper cloth made of?'
	'who is the target audience for this garment'
	'is there a hat worn?',
	'is the navel covered?',
	'does the lower clothing cover the navel?',
	```


	<i>This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.</i>