|
--- |
|
language: es |
|
tags: |
|
- sagemaker |
|
- roberta-bne |
|
- TextClassification |
|
- SentimentAnalysis |
|
license: apache-2.0 |
|
datasets: |
|
- IMDbreviews_es |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: roberta_bne_sentiment_analysis_es |
|
results: |
|
- task: |
|
name: Sentiment Analysis |
|
type: sentiment-analysis |
|
dataset: |
|
name: "IMDb Reviews in Spanish" |
|
type: IMDbreviews_es |
|
metrics: |
|
- name: Accuracy |
|
type: accuracy |
|
value: 0.9106666666666666 |
|
- name: F1 Score |
|
type: f1 |
|
value: 0.9090909090909091 |
|
- name: Precision |
|
type: precision |
|
value: 0.9063852813852814 |
|
- name: Recall |
|
type: recall |
|
value: 0.9118127381600436 |
|
widget: |
|
- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal" |
|
--- |
|
|
|
# Model roberta_bne_sentiment_analysis_es |
|
|
|
## **A finetuned model for Sentiment analysis in Spanish** |
|
|
|
This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container, |
|
The base model is **RoBERTa-base-bne** which is a RoBERTa base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB. |
|
It was trained by The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) |
|
|
|
|
|
**RoBERTa BNE Citation** |
|
Check out the paper for all the details: https://arxiv.org/abs/2107.07253 |
|
|
|
``` |
|
@article{gutierrezfandino2022, |
|
author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas}, |
|
title = {MarIA: Spanish Language Models}, |
|
journal = {Procesamiento del Lenguaje Natural}, |
|
volume = {68}, |
|
number = {0}, |
|
year = {2022}, |
|
issn = {1989-7553}, |
|
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405}, |
|
pages = {39--60} |
|
} |
|
``` |
|
|
|
## Dataset |
|
The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages. |
|
|
|
Sizes of datasets: |
|
- Train dataset: 42,500 |
|
- Validation dataset: 3,750 |
|
- Test dataset: 3,750 |
|
|
|
## Intended uses & limitations |
|
|
|
This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews. |
|
|
|
## Hyperparameters |
|
{ |
|
"epochs": "4", |
|
"train_batch_size": "32", |
|
"eval_batch_size": "8", |
|
"fp16": "true", |
|
"learning_rate": "3e-05", |
|
"model_name": "\"PlanTL-GOB-ES/roberta-base-bne\"", |
|
"sagemaker_container_log_level": "20", |
|
"sagemaker_program": "\"train.py\"", |
|
} |
|
|
|
## Evaluation results |
|
|
|
- Accuracy = 0.9106666666666666 |
|
|
|
- F1 Score = 0.9090909090909091 |
|
|
|
- Precision = 0.9063852813852814 |
|
|
|
- Recall = 0.9118127381600436 |
|
|
|
## Test results |
|
|
|
## Model in action |
|
|
|
### Usage for Sentiment Analysis |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("edumunozsala/roberta_bne_sentiment_analysis_es") |
|
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/roberta_bne_sentiment_analysis_es") |
|
|
|
text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal" |
|
|
|
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) |
|
outputs = model(input_ids) |
|
output = outputs.logits.argmax(1) |
|
``` |
|
|
|
Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala) |
|
|