Model Card for Finetuned NepBertA-NER

This model is a fine-tuned version of the NepBERTa model, specifically trained for Named Entity Recognition (NER) tasks in the Nepali language. It recognizes entities such as persons (PER), organizations (ORG), and locations (LOC) in Nepali text. The model has been trained on a custom dataset and supports token classification for the following entity tags:

O (Other)
B-PER (Beginning of a person’s name)
I-PER (Inside of a person’s name)
B-ORG (Beginning of an organization)
I-ORG (Inside of an organization)
B-LOC (Beginning of a location)
I-LOC (Inside of a location)

Model Details

Model Description

Developed by: Priyanshu Koirala (Synapse Technologies)
Model type: Token Classification (NER)
Language(s) (NLP): Nepali
License: Apache 2.0
Finetuned from model: NepBERTa

Uses

Direct Use

The model can be directly used to recognize and classify named entities in Nepali text, such as persons, organizations, and locations. This is useful for text analysis tasks like extracting important information from Nepali documents, news articles, and customer feedback.

Downstream Use

The model can be further fine-tuned on other similar datasets or integrated into applications for Nepali language processing.

Out-of-Scope Use

The model may not perform well for texts outside the scope of its training data, such as texts with unseen entity types or non-Nepali language texts.

Bias, Risks, and Limitations

As with any NER model, there may be biases in the data that influence how the model identifies and classifies entities. It may struggle with unseen entities, domain-specific jargon, or ambiguous contexts.

Recommendations

Users should evaluate the model in their specific use case, ensuring that the data fed into the model aligns with the training data, and understand that the model might require further fine-tuning for specialized tasks.

How to Get Started with the Model

Use the following code to start using the model:

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
tokenizer = AutoTokenizer.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
model.to(device)

def predict_ner_chunked(text, model, tokenizer, device, max_length=512):
    model.eval()
    words = text.split()
    ner_results = []
    
    for i in range(0, len(words), max_length):
        chunk = ' '.join(words[i:i+max_length])
        tokens = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True, max_length=max_length)
        tokens = {k: v.to(device) for k, v in tokens.items()}
        
        with torch.no_grad():
            outputs = model(**tokens)
        
        predictions = torch.argmax(outputs.logits, dim=2)
        predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]
        
        chunk_words = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])
        for word, label in zip(chunk_words, predicted_labels):
            if label in ["B-PER", "I-PER", "B-ORG"] and word not in ["[CLS]", "[SEP]", "[PAD]"]:
                ner_results.append((word, label))
    
    return ner_results

# Test the model
text = "सङ्घीय लोकतान्त्रिक गणतन्त्र नेपालको प्रधानमन्त्री शेरबहादुर देउवा हुन्।"
ner_results = predict_ner_chunked(text, model, tokenizer, device)
print(ner_results)

Training Details

Training Data

The model was trained on a custom-labeled dataset in Nepali, consisting of sentences annotated with named entities for People (PER), Organizations (ORG), and Locations (LOC).

Training Procedure

Optimizer: AdamW
Learning Rate: 5e-5
Batch Size: 16
Epochs: 5
Validation Split: 20% of the dataset was reserved for validation.
Hardware: Trained on a single GPU.

Training Hyperparameters

Number of labels: 7 (including O label)
Maximum sequence length: 128 tokens
Gradient accumulation: 1

Evaluation

Metrics

The model was evaluated using the seqeval metric, with the following results on the validation set:

F1 Score: 0.89
Precision: 0.86
Recall: 0.90

Citation for the Base Model

If you use this model or the base model in your work, please consider citing NepBERTa as follows:

@inproceedings{timilsina2022nepberta,
  title={NepBERTa: Nepali language model trained in a large corpus},
  author={Timilsina, Sulav and Gautam, Milan and Bhattarai, Binod},
  booktitle={Proceedings of the 2nd conference of the Asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing},
  year={2022},
  organization={Association for Computational Linguistics (ACL)}
}

Citation

If you use this model in your research, please consider citing it:

@misc{nepali_ner,
  author = {Synapse Technologies},
  title = {Finetuned NepBertA-NER for Nepali},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SynapseHQ/Finetuned-NER-NepBertA}},
}

SynapseHQ
/

Finetuned-NER-NepBertA