QA-CLIP-ViT-L-14 / README.md

Update README.md

cb58e3c over 1 year ago

8.95 kB

	---
	license: apache-2.0
	widget:
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
	candidate_labels: 音乐表演, 体育运动
	example_title: 猫和狗
	---
	[中文说明](README_CN.md) \| [English](README.md)
	# Introduction
	This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
	This project is produced by QQ-ARC Joint Lab, Tencent PCG. For more detailed information, please refer to the [main page of the QA-CLIP project](https://huggingface.co/TencentARC/QA-CLIP). We have also open-sourced our code on GitHub, [QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP), and welcome to star!
	<br><br>

	## Results
	We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:

	Flickr30K-CN Zero-shot Retrieval (Official Test Set):
	<table border="1" width="120%">
	<tr align="center">
	<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
	</tr>
	<tr align="center">
	<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
	</tr>
	<tr align="center">
	<td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td><b>94.8</b></td><td>84.8</td><td>97.7</td><td>99.1</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td>94.7</td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
	</tr>
	</table>
	<br>

	MUGE Zero-shot Retrieval (Official Validation Set):
	<table border="1" width="120%">
	<tr align="center">
	<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
	</tr>
	<tr align="center">
	<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
	</tr>
	<tr align="center">
	<td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
	</tr>
	</table>
	<br>

	COCO-CN Zero-shot Retrieval (Official Test Set):
	<table border="1" width="120%">
	<tr align="center">
	<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
	</tr>
	<tr align="center">
	<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
	</tr>
	<tr align="center">
	<td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
	</tr>
	</table>
	<br>

	Zero-shot Image Classification on ImageNet:
	<table border="1" width="120%">
	<tr align="center">
	<th>Task</th><th colspan="1">ImageNet</th>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
	</tr>
	<tr align="center">
	<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
	</tr>
	<tr align="center", style="background-color: Honeydew;">
	<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
	</tr>
	</table>
	<br>

	<br><br>


	# Getting Started

	## Inference Code
	Inference code example：
	```python
	from PIL import Image
	import requests
	from transformers import ChineseCLIPProcessor, ChineseCLIPModel

	model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
	processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")

	url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
	image = Image.open(requests.get(url, stream=True).raw)
	# Squirtle, Bulbasaur, Charmander, Pikachu in English
	texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

	# compute image feature
	inputs = processor(images=image, return_tensors="pt")
	image_features = model.get_image_features(**inputs)
	image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize

	# compute text features
	inputs = processor(text=texts, padding=True, return_tensors="pt")
	text_features = model.get_text_features(**inputs)
	text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize

	# compute image-text similarity scores
	inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
	outputs = model(**inputs)
	logits_per_image = outputs.logits_per_image # this is the image-text similarity score
	probs = logits_per_image.softmax(dim=1)
	```
	<br><br>

	# Acknowledgments
	The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
	<br><br>