yanka9
/

vilt_finetuned_deepfashionVQA_v2

Visual Question Answering

Inference Endpoints

Model card Files Files and versions Community

yanka9 commited on Jul 1

Commit

b26a3ba

•

1 Parent(s): 8ab9e82

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ widget:
 ### Summary
-Finetuning a Vision-and-Language Pre-training (VLP) model for a fashion-related downstream task, Visual Question Answering (VQA). The related model, ViLT, was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) and incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for VLP.
 ### Model Description

 ### Summary
+A Vision-and-Language Pre-training (VLP) model for a fashion-related downstream task, Visual Question Answering (VQA). The related model, ViLT, was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) and incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for VLP.
 ### Model Description