IPTC topic classifier (multilingual)
A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories. The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics. The dataset I used to train the model is based on this file: https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long
Accuracy on highest level labels in eval: 0.9779412 Accuracy/F1/mcc on mid level labels in eval: 0.6992481/0.6666667/0.6992617
More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance. https://www.kaggle.com/datasets/rmisra/news-category-dataset
While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category. My current impression is this system is useful for the aimed for purpose.
The numeric categories can be joined with the labels by using this table: https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels
Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that.
Use like any other SetFit model
from setfit import SetFitModel
Download from Hub and run inference
model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml")
Run inference
preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"])
- Downloads last month
- 23