GitHub issues classifier (using zero shot classification)
Predicts wether a statement is a feature request, issue/bug or question
This model was trained using the Zero-shot classifier distillation method with the BART-large-mnli model as teacher model, to train a classifier on Github issues from the Github Issues Prediction dataset
Labels
As per the dataset Kaggle competition, the classifier predicts wether an issue is a bug, feature or question. After playing around with different labels pre-training I've used a different mapping of labels that yielded better predictions (see notebook here for details), labels being
- issue
- feature request
- question
Training data
- 15k of Github issues titles ("unlabeled_titles_simple.txt")
- Hypothesis used: "This request is a {}"
- Teacher model used: valhalla/distilbart-mnli-12-1
- Studend model used: distilbert-base-uncased
Results
Agreement of student and teacher predictions: 94.82%
See this notebook for more info on feature engineering choice made
How to train using your own dataset
- Download training dataset from https://www.kaggle.com/datasets/anmolkumar/github-bugs-prediction
- Modify and run convert.py, updating the paths to convert to a CSV
- Run distill.py with the csv file (see here for more info)
Acknowledgements
- Joe Davison and his article on Zero-Shot Learning in Modern NLP
- Jeremy Howard, fast.ai and his notebook Iterate like a grandmaster
- Downloads last month
- 6,003
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.