Model Ariving Soon, Still Training
Model Card: bert-dutch-finetuned
Model Description
Model Name: bert-dutch-finetuned
Model Type: BERT (Bidirectional Encoder Representations from Transformers)
Base Model: bert-base-cased
Language: Dutch (Nederlands)
Task: Masked Language Modeling (MLM), Text Classification, and other NLP tasks.
This model is a fine-tuned version of the bert-base-cased
model, adapted specifically for the Dutch language. It is pre-trained on a large Dutch corpus from OSCAR and other Dutch datasets. The model is capable of understanding and generating text in Dutch and can be further fine-tuned for specific downstream NLP tasks like Named Entity Recognition (NER), Sentiment Analysis, etc.
Intended Use
The bert-dutch-finetuned
model can be used for various NLP tasks in Dutch, including:
- Masked Language Modeling (MLM)
- Text Classification
- Named Entity Recognition (NER)
- Question Answering (QA)
- Text Summarization
This model is ideal for researchers and practitioners working on Dutch NLP applications.
How to Use
To use this model with the Hugging Face transformers
library:
from transformers import BertTokenizer, BertForMaskedLM
# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained("your-username/bert-dutch-finetuned")
model = BertForMaskedLM.from_pretrained("your-username/bert-dutch-finetuned")
# Example usage
inputs = tokenizer("Dit is een voorbeeldzin in het Nederlands.", return_tensors="pt")
outputs = model(**inputs)
Training Data
The model was trained on a large Dutch corpus consisting of various publicly available datasets, such as:
- OSCAR (Open Super-large Crawled ALMAnaCH Corpus): A multilingual corpus obtained by language classification and filtering of the Common Crawl dataset.
- Dutch Wikipedia Dumps: A collection of Dutch Wikipedia pages.
The training data includes diverse text types, covering a wide range of topics to ensure robust language understanding.
Training Procedure
The model was fine-tuned using the following setup:
- Base Model:
bert-base-cased
- Training Objective: Masked Language Modeling (MLM)
- Optimizer: AdamW
- Learning Rate: 5e-5
- Batch Size: 8
- Epochs: 3
- Hardware Used: A GPU-enabled environment (NVIDIA V100)
Evaluation
The model was evaluated on a validation set split from the same training corpus. The evaluation metrics included:
- Perplexity for Masked Language Modeling
- Accuracy for Text Classification tasks (if applicable)
The model performs well on standard Dutch text understanding tasks but might require further fine-tuning for specific downstream applications.
Limitations and Biases
- The model may exhibit biases present in the training data. This includes potential social biases or stereotypes embedded in large web-scraped datasets like OSCAR.
- The model's performance is optimized for Dutch and may not generalize well to other languages.
- It may not perform well on domain-specific tasks without additional fine-tuning.
Ethical Considerations
Users should be aware of the biases that might be present in the model outputs. It is recommended to conduct a bias assessment before deploying the model in sensitive applications, especially those related to decision-making.
Acknowledgments
This model was built using the Hugging Face transformers
library and fine-tuned on the OSCAR and Dutch Wikipedia datasets. Special thanks to the creators and maintainers of these resources.
Citation
If you use this model in your research or applications, please consider citing:
@misc{bert-dutch-finetuned,
author = DJ Ober,
title = BERT Fine-Tuned for Dutch Language,
year = 2024,
howpublished = https://huggingface.co/dober123/bert-dutch-finetuned,
}
license: wtfpl
Model tree for dober123/bert-dutch-finetuned
Base model
google-bert/bert-base-uncased