File size: 3,291 Bytes
4c12604 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
language: es
tags:
- sagemaker
- bertin
- TextClassification
- SentimentAnalysis
license: apache-2.0
datasets:
- IMDbreviews_es
metrics:
- accuracy
model-index:
- name: bertin_base_sentiment_analysis_es
results:
- task:
name: Sentiment Analysis
type: sentiment-analysis
dataset:
name: "IMDb Reviews in Spanish"
type: IMDbreviews_es
metrics:
- name: Accuracy,
type: accuracy,
value: 0.898933
- name: F1 Score,
type: f1,
value: 0.8989063
- name: Precision,
type: precision,
value: 0.8771473
- name: Recall,
type: recall,
value: 0.9217724
widget:
- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
---
## Model `bertin_base_sentiment_analysis_es`
### **A finetuned model for Sentiment analysis in Spanish**
This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
The base model is **Bertin base** which is a RoBERTa-base model pre-trained on the Spanish portion of mC4 using Flax.
It was trained by the Bertin Project.[Link to base model](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
Article: BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Author = Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury,
journal = Procesamiento del Lenguaje Natural,
volume = 68, number = 0, year = 2022
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
## Dataset
The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.
Sizes of datasets:
- Train dataset: 42,500
- Validation dataset: 3,750
- Test dataset: 3,750
## Intended uses & limitations
This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.
## Hyperparameters
{
"epochs": "4",
"train_batch_size": "32",
"eval_batch_size": "8",
"fp16": "true",
"learning_rate": "3e-05",
"model_name": "\"bertin-project/bertin-roberta-base-spanish\"",
"sagemaker_container_log_level": "20",
"sagemaker_program": "\"train.py\"",
}
## Evaluation results
Accuracy = 0.8989333333333334
F1 Score = 0.8989063750333421
Precision = 0.877147319104633
Recall = 0.9217724288840262
## Test results
## Model in action
### Usage for Sentiment Analysis
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")
text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
output = outputs.logits.argmax(1)
```
Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)
|