Model Card for Model anglicisms-spanish-beto

This is a pretrained model for detecting unassimilated English lexical borrowings (a.k.a. anglicisms) on Spanish newswire. This model labels words of foreign origin (fundamentally from English) used in Spanish language, words such as fake news, machine learning, smartwatch, influencer or streaming.

Model Details

Model Description

The model is a fine-tuned version of BETO trained on the COALAS corpus for the task of detecting lexical borrowings.

The model considers two labels:

ENG: For English lexical borrowings (smartphone, online, podcast)
OTHER: For lexical borrowings from any other language (boutique, anime, umami)

The model uses BIO encoding to account for multitoken borrowings.

⚠ This is not the best-performing model for this task. For the best-performing model (F1=85.76) see Flair model or mBERT model (F1=83.5).

Developed and shared by: Elena Álvarez Mellado
Language(s) (NLP): Spanish
License: cc-by-sa-4.0
Finetuned from model: BETO

Model Sources [optional]

Paper: Elena Álvarez-Mellado and Constantine Lignos, 2022. Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3868–3888, Dublin, Ireland. Association for Computational Linguistics.
Demo:
Observatory of anglicism usage in the Spanish press
pylazaro Python library

Metrics (on the test set)

The following table summarizes the results obtained by this model on the test set of the COALAS corpus.

LABEL	Precision	Recall	F1
ALL	85.03	81.32	83.13
ENG	85.25	83.94	84.59
OTHER	55.56	10.87	18.18

Dataset

This model was trained on COALAS, a corpus of Spanish newswire annotated with unassimilated lexical borrowings. The corpus contains 370,000 tokens and includes various written media written in European Spanish. The test set was designed to be as difficult as possible: it covers sources and dates not seen in the training set, includes a high number of OOV words (92% of the borrowings in the test set are OOV) and is very borrowing-dense (20 borrowings per 1,000 tokens).

Set	Tokens	ENG	OTHER	Unique
Training	231,126	1,493	28	380
Development	82,578	306	49	316
Test	58,997	1,239	46	987
Total	372,701	3,038	123	1,683

More info

More information about the dataset, model experimentation and error analysis can be found in the paper: Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling.

How to use

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-beto")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = example = "Buscamos data scientist para proyecto de machine learning."

borrowings = nlp(example)
print(borrowings)

Citation

BibTeX:

If you use this model, please cite the following reference:

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    doi = "10.18653/v1/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}