IPTC topic classifier (multilingual)

A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories. The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics. The dataset I used to train the model is based on this file: https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long

Accuracy on highest level labels in eval: 0.9779412 Accuracy/F1/mcc on mid level labels in eval: 0.6992481/0.6666667/0.6992617

More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance. https://www.kaggle.com/datasets/rmisra/news-category-dataset

While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category. My current impression is this system is useful for the aimed for purpose.

The numeric categories can be joined with the labels by using this table: https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels

Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that.

Use like any other SetFit model

from setfit import SetFitModel

Download from Hub and run inference

model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml")

Run inference

preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"])

KnutJaegersberg
/

topic-classification-IPTC-subject-labels

IPTC topic classifier (multilingual)

Download from Hub and run inference

Run inference

Dataset used to train KnutJaegersberg/topic-classification-IPTC-subject-labels