File size: 3,246 Bytes

5ebd548
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d86bcf
 
5ebd548
1d86bcf
 
5ebd548
1d86bcf
 
5ebd548
1d86bcf
 
5ebd548

---
language: es
tags:
- sagemaker
- beto
- TextClassification
- SentimentAnalysis
license: apache-2.0
datasets:
- IMDbreviews_es
metrics:
- accuracy
model-index:
- name: beto_sentiment_analysis_es
  results:
  - task:
        name: Sentiment Analysis
        type: sentiment-analysis
    dataset:
        name: "IMDb Reviews in Spanish" 
        type: IMDbreviews_es
    metrics:
       - name: Accuracy
         type: accuracy
         value: 0.9101333333333333
       - name: F1 Score
         type: f1
         value: 0.9088450094671354
       - name: Precision
         type: precision
         value: 0.9105691056910569
       - name: Recall
         type: recall
         value: 0.9071274298056156
widget:
- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
---

# Model beto_sentiment_analysis_es

## **A finetuned model for Sentiment analysis in Spanish**

This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
The base model is **BETO** which is a BERT-base model pre-trained  on a spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. 

**BETO Citation**

[Spanish Pre-Trained BERT Model and Evaluation Data](https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf)

```
@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}
```

## Dataset
The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages. 

Sizes of datasets:
- Train dataset: 42,500
- Validation dataset: 3,750
- Test dataset: 3,750

## Intended uses & limitations

This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.

## Hyperparameters
    {
    "epochs": "4",
    "train_batch_size": "32",    
    "eval_batch_size": "8",
    "fp16": "true",
    "learning_rate": "3e-05",
    "model_name": "\"dccuchile/bert-base-spanish-wwm-uncased\"",
    "sagemaker_container_log_level": "20",
    "sagemaker_program": "\"train.py\"",
    }

## Evaluation results

- Accuracy = 0.9101333333333333

- F1 Score = 0.9088450094671354

- Precision = 0.9105691056910569

- Recall = 0.9071274298056156

## Test results

## Model in action

### Usage for Sentiment Analysis

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("edumunozsala/beto_sentiment_analysis_es")
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/beto_sentiment_analysis_es")

text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"

input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
output = outputs.logits.argmax(1)
```

Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)