File size: 3,291 Bytes
4c12604
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language: es
tags:
- sagemaker
- bertin
- TextClassification
- SentimentAnalysis
license: apache-2.0
datasets:
- IMDbreviews_es
metrics:
- accuracy
model-index:
- name: bertin_base_sentiment_analysis_es
  results:
  - task:
        name: Sentiment Analysis
        type: sentiment-analysis
    dataset:
        name: "IMDb Reviews in Spanish" 
        type: IMDbreviews_es
    metrics:
       - name: Accuracy,
         type: accuracy,
         value: 0.898933
       - name: F1 Score,
         type: f1,
         value: 0.8989063
       - name: Precision,
         type: precision,
         value: 0.8771473
       - name: Recall,
         type: recall,
         value: 0.9217724
widget:
- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
---

## Model `bertin_base_sentiment_analysis_es`

### **A finetuned model for Sentiment analysis in Spanish**

This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
The base model is **Bertin base** which is a RoBERTa-base model pre-trained  on the Spanish portion of mC4 using Flax.
It was trained by the Bertin Project.[Link to base model](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) 

Article: BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Author = Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury,
journal = Procesamiento del Lenguaje Natural,
volume = 68, number = 0, year = 2022
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},

## Dataset
The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages. 

Sizes of datasets:
- Train dataset: 42,500
- Validation dataset: 3,750
- Test dataset: 3,750

## Intended uses & limitations

This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.

## Hyperparameters
    {
    "epochs": "4",
    "train_batch_size": "32",    
    "eval_batch_size": "8",
    "fp16": "true",
    "learning_rate": "3e-05",
    "model_name": "\"bertin-project/bertin-roberta-base-spanish\"",
    "sagemaker_container_log_level": "20",
    "sagemaker_program": "\"train.py\"",
    }

## Evaluation results
Accuracy = 0.8989333333333334
F1 Score = 0.8989063750333421
Precision = 0.877147319104633
Recall = 0.9217724288840262

## Test results

## Model in action

### Usage for Sentiment Analysis

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")

text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"

input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
output = outputs.logits.argmax(1)
```

Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)