Spanish News Classification Headlines
SNCH: this model was developed by M47Labs the goal is text classification, the base model use was BETO, however this model has not been fine-tuned on any dataset. The objective is to show the performance of this model when is used with the objective of inference without training at all.
Dataset validation Sample
Dataset size : 1000
Columns: idTask,task content 1,idTag,tag.
task content | tag |
---|---|
Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización | sociedad |
El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón | deportes |
Un total de 39 personas padecen ELA actualmente en la provincia | sociedad |
Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada | deportes |
Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza. | sociedad |
El primer ministro sueco pierde una moción de censura | politica |
El dólar se dispara tras la reunión de la Fed | economia |
Labels:
ciencia_tecnologia
clickbait
cultura
deportes
economia
educacion
medio_ambiente
opinion
politica
sociedad
Example of Use
Pipeline
import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
path = "M47Labs/spanish_news_classification_headlines_untrained"
tokenizer = AutoTokenizer.from_pretrained(path)
model = BertForSequenceClassification.from_pretrained(path)
nlp = TextClassificationPipeline(task = "text-classification",
model = model,
tokenizer = tokenizer)
print(nlp(review_text))
[{'label': 'medio_ambiente', 'score': 0.2834321384291023}]
Pytorch
import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
from numpy import np
model_name = 'M47Labs/spanish_news_classification_headlines_untrained'
MAX_LEN = 32
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"
encoded_review = tokenizer.encode_plus(
texto,
max_length=MAX_LEN,
add_special_tokens=True,
#return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
input_ids = encoded_review['input_ids']
attention_mask = encoded_review['attention_mask']
output = model(input_ids, attention_mask)
_, prediction = torch.max(output['logits'], dim=1)
print(f'Review text: {texto}')
print(f'Sentiment : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')
Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno
Sentiment : opinion
A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
Validation Results
Full Dataset | |
---|---|
Accuracy Score | 0.362 |
Precision (Macro) | 0.21 |
Recall (Macro) | 0.22 |
- Downloads last month
- 29