metadata
language:
- tl
license: gpl-3.0
library_name: span-marker
tags:
- span-marker
- token-classification
- ner
- named-entity-recognition
- generated_from_span_marker_trainer
datasets:
- ljvmiranda921/tlunified-ner
metrics:
- precision
- recall
- f1
widget:
- text: >-
MANILA - Binalewala ng Philippine National Police (PNP) nitong Sabado ang
posibleng paglulunsad ng tinatawag na " sympathy attacks " ng Moro
National Liberation Front (MNLF) at Abu Sayyaf matapos arestuhin si
Indanan, Sulu Mayor Alvarez Isnaji.
- text: >-
Pinatawan din ng apat na buwang suspensyon si Herma Gonzales - Escudero,
chief revenue officer III ng BIR - Cotabato City, dahil sa kasong
dishonesty at limang kaso ng perjury sa Municipal Trial Court ng Cotabato
City . Bunga ito ng kanyang kabiguan na ideklara sa kanyang SALN noong
2002 - 2004 ang 200 metro kwadradong lote sa South Cotabato at Toyota Revo
noong 2001 SALN at undervaluation ng kanyang mga ari - arian sa lalawigan
noong 2000 - 2004 SALN.
- text: >-
Sa tila pagpapabaya sa mga magsasaka, sinabi ni Escudero na hindi
mangyayari ang pangarap ng Department of Agriculture (DA) na maging self -
sufficient ang Pilipinas sa bigas.
- text: >-
MANILA - Tiniyak ng pinuno ng Government Service Insurance System (GSIS)
na tatapatan nito ang pro - Meralco advertisement ni Judy Ann Santos upang
isulong ang kanyang posisyon na dapat ibaba ang singil sa kuryente.
- text: >-
Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na
ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang
ipapasang panukala ng Kongreso.
pipeline_tag: token-classification
co2_eq_emissions:
emissions: 17.80725395240375
source: codecarbon
training_type: fine-tuning
on_cloud: false
cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
ram_total_size: 31.777088165283203
hours_used: 0.142
hardware_used: 1 x NVIDIA GeForce RTX 3090
base_model: jcblaise/roberta-tagalog-base
model-index:
- name: SpanMarker with jcblaise/roberta-tagalog-base on TLUnified
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: TLUnified
type: ljvmiranda921/tlunified-ner
split: test
metrics:
- type: f1
value: 0.8962499999999999
name: F1
- type: precision
value: 0.8830049261083743
name: Precision
- type: recall
value: 0.9098984771573604
name: Recall
SpanMarker with jcblaise/roberta-tagalog-base on TLUnified
This is a SpanMarker model trained on the TLUnified dataset that can be used for Named Entity Recognition. This SpanMarker model uses jcblaise/roberta-tagalog-base as the underlying encoder.
Model Details
Model Description
- Model Type: SpanMarker
- Encoder: jcblaise/roberta-tagalog-base
- Maximum Sequence Length: 256 tokens
- Maximum Entity Length: 8 words
- Training Dataset: TLUnified
- Language: tl
- License: gpl-3.0
Model Sources
- Repository: SpanMarker on GitHub
- Thesis: SpanMarker For Named Entity Recognition
Model Labels
Label | Examples |
---|---|
LOC | "Batasan", "United States", "Israel" |
ORG | "MMDA", "International Monitoring Team", "Coordinating Committees for the Cessation of Hostilities" |
PER | "Villavicencio", "Puno", "Fernando" |
Evaluation
Metrics
Label | Precision | Recall | F1 |
---|---|---|---|
all | 0.8830 | 0.9099 | 0.8962 |
LOC | 0.8831 | 0.9293 | 0.9056 |
ORG | 0.7948 | 0.8476 | 0.8204 |
PER | 0.9235 | 0.9280 | 0.9257 |
Uses
Direct Use for Inference
from span_marker import SpanMarkerModel
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-tagalog-base-tlunified")
# Run inference
entities = model.predict("Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang ipapasang panukala ng Kongreso.")
Downstream Use
You can finetune this model on your own dataset.
Click to expand
from span_marker import SpanMarkerModel, Trainer
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-tagalog-base-tlunified")
# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003
# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-roberta-tagalog-base-tlunified-finetuned")
Training Details
Training Set Metrics
Training set | Min | Median | Max |
---|---|---|---|
Sentence length | 1 | 31.7625 | 150 |
Entities per sentence | 0 | 2.0661 | 38 |
Training Hyperparameters
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3
Training Results
Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
---|---|---|---|---|---|---|
0.6969 | 200 | 0.0083 | 0.8827 | 0.8628 | 0.8726 | 0.9762 |
1.3937 | 400 | 0.0067 | 0.8881 | 0.8959 | 0.8920 | 0.9798 |
2.0906 | 600 | 0.0069 | 0.8820 | 0.9040 | 0.8929 | 0.9800 |
2.7875 | 800 | 0.0070 | 0.8757 | 0.9133 | 0.8941 | 0.9807 |
Environmental Impact
Carbon emissions were measured using CodeCarbon.
- Carbon Emitted: 0.018 kg of CO2
- Hours Used: 0.142 hours
Training Hardware
- On Cloud: No
- GPU Model: 1 x NVIDIA GeForce RTX 3090
- CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
- RAM Size: 31.78 GB
Framework Versions
- Python: 3.9.16
- SpanMarker: 1.5.1.dev
- Transformers: 4.30.0
- PyTorch: 2.0.1+cu118
- Datasets: 2.14.0
- Tokenizers: 0.13.3
Citation
BibTeX
@software{Aarsen_SpanMarker,
author = {Aarsen, Tom},
license = {Apache-2.0},
title = {{SpanMarker for Named Entity Recognition}},
url = {https://github.com/tomaarsen/SpanMarkerNER}
}