Update README.md

fb62bce 12 months ago

4.93 kB

	---
	license: apache-2.0
	base_model: google/vit-base-patch16-224
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: Human-Action-Recognition-VIT-Base-patch16-224
	results: []
	datasets:
	- Bingsu/Human_Action_Recognition
	language:
	- en
	pipeline_tag: image-classification
	---



	# Human-Action-Recognition-VIT-Base-patch16-224

	This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on [Bingsu/Human_Action_Recognition](https://huggingface.co/datasets/Bingsu/Human_Action_Recognition) dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.4005
	- Accuracy: 0.8786

	## Model description

	The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

	Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

	By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.

	## Intended uses & limitations

	You can use the model for image classification.

	### How to use

	Here is how to use this model to classify an image of the human action into one of the following categories:
	calling, clapping, cycling, dancing, drinking, eating, fighting, hugging, laughing, listening_to_music, running, sitting, sleeping, texting, using_laptop

	```python
	from transformers import pipeline
	from PIL import Image
	import requests

	pipe = pipeline("image-classification", "rvv-karma/Human-Action-Recognition-VIT-Base-patch16-224")
	url = "https://images.pexels.com/photos/175658/pexels-photo-175658.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500"
	image = Image.open(requests.get(url, stream=True).raw)
	pipe(image)

	# Output:
	# [{'score': 0.9918079972267151, 'label': 'dancing'},
	# {'score': 0.00207977625541389, 'label': 'clapping'},
	# {'score': 0.0015223610680550337, 'label': 'running'},
	# {'score': 0.0009153694845736027, 'label': 'fighting'},
	# {'score': 0.0006987180095165968, 'label': 'sitting'}]
	```

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 64
	- eval_batch_size: 64
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 256
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 20

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| 2.6396 \| 0.99 \| 39 \| 2.0436 \| 0.4425 \|
	\| 1.4579 \| 2.0 \| 79 \| 0.7553 \| 0.7917 \|
	\| 0.8342 \| 2.99 \| 118 \| 0.5296 \| 0.8417 \|
	\| 0.6649 \| 4.0 \| 158 \| 0.4978 \| 0.8496 \|
	\| 0.6137 \| 4.99 \| 197 \| 0.4460 \| 0.8595 \|
	\| 0.5374 \| 6.0 \| 237 \| 0.4356 \| 0.8627 \|
	\| 0.514 \| 6.99 \| 276 \| 0.4349 \| 0.8615 \|
	\| 0.475 \| 8.0 \| 316 \| 0.4005 \| 0.8786 \|
	\| 0.4663 \| 8.99 \| 355 \| 0.4164 \| 0.8659 \|
	\| 0.4178 \| 10.0 \| 395 \| 0.4128 \| 0.8738 \|
	\| 0.4226 \| 10.99 \| 434 \| 0.4115 \| 0.8690 \|
	\| 0.3896 \| 12.0 \| 474 \| 0.4112 \| 0.875 \|
	\| 0.3866 \| 12.99 \| 513 \| 0.4072 \| 0.8714 \|
	\| 0.3632 \| 14.0 \| 553 \| 0.4106 \| 0.8718 \|
	\| 0.3596 \| 14.99 \| 592 \| 0.4043 \| 0.8714 \|
	\| 0.3421 \| 16.0 \| 632 \| 0.4128 \| 0.8675 \|
	\| 0.344 \| 16.99 \| 671 \| 0.4181 \| 0.8643 \|
	\| 0.3447 \| 18.0 \| 711 \| 0.4128 \| 0.8687 \|
	\| 0.3407 \| 18.99 \| 750 \| 0.4097 \| 0.8714 \|
	\| 0.3267 \| 19.75 \| 780 \| 0.4097 \| 0.8683 \|


	### Framework versions

	- Transformers 4.35.2
	- Pytorch 2.1.0+cu118
	- Datasets 2.15.0
	- Tokenizers 0.15.0


	## Fine-tuning script

	[Google Colaboratory Notebook](https://colab.research.google.com/drive/1YELczSv8r0znzcOKJ4Lt-ecP-aNqk7NV?usp=sharing)