--- license: cc-by-4.0 language: - tn datasets: - dsfsi/PuoData - dsfsi/daily-news-dikgang metrics: - f1 library_name: transformers pipeline_tag: text-classification tags: - iptc --- # PuoBERTa-News: A Setswana Langauge Model Finetuned for News Categorisation [![Zenodo doi badge](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.8434795-blue.svg)](https://doi.org/10.5281/zenodo.8434795) [![arXiv](https://img.shields.io/badge/arXiv-2310.09141-b31b1b.svg)](https://arxiv.org/abs/2310.09141) 🤗 [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa) Give Feedback 📑: [DSFSI Resource Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/formResponse){:target="_blank"} A Roberta-based language model finetuned for News Categorisation. Based on [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa) ## Model Details ### Model Description This is a News Categorisation model for Setswana. - **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co/@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co/@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai - **Model type:** RoBERTa Model - **Language(s) (NLP):** Setswana - **License:** CC BY 4.0 ### News Categories We use the IPTC news codes [https://iptc.org/standards/newscodes/](https://iptc.org/standards/newscodes/) 0. arts_culture_entertainment_and_media (Botsweretshi, setso, boitapoloso le bobegakgang) 1. crime_law_and_justice (Bosenyi, molao le bosiamisi) 2. disaster_accident_and_emergency_incident (Masetlapelo, kotsi le tiragalo ya maemo a tshoganyetso) 3. economy_business_and_finance (Ikonomi, tsa kgwebo le tsa ditšhelete) 4. education (Thuto) 5. environment (Tikologo) 6. health (Boitekanelo) 7. politics (Dipolotiki) 8. religion_and_belief (Bodumedi le tumelo) 9. society (Setšhaba) Training, Dev and Validation dataset [https://huggingface.co/datasets/dsfsi/daily-news-dikgang](https://huggingface.co/datasets/dsfsi/daily-news-dikgang). ### Model Performance Performance of models on Daily News Dikgang dataset | **Model** | **5-fold Cross Validation F1** | **Test F1** | |-----------------------------|--------------------------------------|-------------------| | Logistic Regression + TFIDF | 60.1 | 56.2 | | NCHLT TSN RoBERTa | 64.7 | 60.3 | | PuoBERTa | **63.8** | **62.9** | | PuoBERTaJW300 | *66.2* | *65.4* | ### Usage Use this model for Part of text classification for Setswana. ```python ``` ## Citation Information Bibtex Reference ``` @inproceedings{marivate2023puoberta, title = {PuoBERTa: Training and evaluation of a curated language model for Setswana}, author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai}, year = {2023}, booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science}, url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17}, keywords = {NLP}, preprint_url = {https://arxiv.org/abs/2310.09141}, dataset_url = {https://github.com/dsfsi/PuoBERTa}, software_url = {https://huggingface.co/dsfsi/PuoBERTa} } ``` ## Contributing Your contributions are welcome! Feel free to improve the model. ## Model Card Authors Vukosi Marivate ## Model Card Contact For more details, reach out or check our [website](https://dsfsi.github.io/). Email: vukosi.marivate@cs.up.ac.za **Enjoy exploring Setswana through AI!**