edumunozsala
/

beto_sentiment_analysis_es

Text Classification

TextClassification

SentimentAnalysis

Inference Endpoints

Model card Files Files and versions Community

beto_sentiment_analysis_es / README.md

edumunozsala's picture

Upload README.md

5ebd548 over 2 years ago

|

3.25 kB

	---
	language: es
	tags:
	- sagemaker
	- beto
	- TextClassification
	- SentimentAnalysis
	license: apache-2.0
	datasets:
	- IMDbreviews_es
	metrics:
	- accuracy
	model-index:
	- name: beto_sentiment_analysis_es
	results:
	- task:
	name: Sentiment Analysis
	type: sentiment-analysis
	dataset:
	name: "IMDb Reviews in Spanish"
	type: IMDbreviews_es
	metrics:
	- name: Accuracy,
	type: accuracy,
	value: 0.9101333333333333
	- name: F1 Score,
	type: f1,
	value: 0.9088450094671354
	- name: Precision,
	type: precision,
	value: 0.9105691056910569
	- name: Recall,
	type: recall,
	value: 0.9071274298056156
	widget:
	- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
	---

	# Model beto_sentiment_analysis_es

	## A finetuned model for Sentiment analysis in Spanish

	This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
	The base model is BETO which is a BERT-base model pre-trained on a spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique.

	BETO Citation

	[Spanish Pre-Trained BERT Model and Evaluation Data](https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf)

	```
	@inproceedings{CaneteCFP2020,
	title={Spanish Pre-Trained BERT Model and Evaluation Data},
	author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
	booktitle={PML4DC at ICLR 2020},
	year={2020}
	}
	```

	## Dataset
	The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.

	Sizes of datasets:
	- Train dataset: 42,500
	- Validation dataset: 3,750
	- Test dataset: 3,750

	## Intended uses & limitations

	This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.

	## Hyperparameters
	{
	"epochs": "4",
	"train_batch_size": "32",
	"eval_batch_size": "8",
	"fp16": "true",
	"learning_rate": "3e-05",
	"model_name": "\"dccuchile/bert-base-spanish-wwm-uncased\"",
	"sagemaker_container_log_level": "20",
	"sagemaker_program": "\"train.py\"",
	}

	## Evaluation results

	- Accuracy = 0.9101333333333333

	- F1 Score = 0.9088450094671354

	- Precision = 0.9105691056910569

	- Recall = 0.9071274298056156

	## Test results

	## Model in action

	### Usage for Sentiment Analysis

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("edumunozsala/beto_sentiment_analysis_es")
	model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/beto_sentiment_analysis_es")

	text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"

	input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
	outputs = model(input_ids)
	output = outputs.logits.argmax(1)
	```

	Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)