Model overconfident
I'd love some advice on how to avoid getting 99% likelihood on statements that are clearly apolitical:
## initialize the political huggingface model (republican vs democrat tweets)
## https://huggingface.co/m-newhauser/distilbert-political-tweets
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
poltokn = AutoTokenizer.from_pretrained("m-newhauser/distilbert-political-tweets")
polmodel = AutoModelForSequenceClassification.from_pretrained("m-newhauser/distilbert-political-tweets")
polpipeline = pipeline("sentiment-analysis", model=polmodel, tokenizer=poltokn)
Testing...
polpipeline("These pretzels are making me thirsty!")
>>>[{'label': 'Republican', 'score': 0.9996196031570435}]
Pretzels make Democrats thirsty too, I believe.
I've mitigated this problem by taking the average from a bunch of tweets for an account. This person is mostly Democrat but makes a lot of statements that would appeal to folks in the middle. And the balance seems accurate.
@DeanObeidallah
PREDICTED PARTY
Counter({'Democrat': 211, 'Republican': 129})
{'rep': 37.94, 'dem': 62.06}
Am I to assume that this model was trained with any apolitical tweets? A better version would tell us when a tweet is not political, so it can be excluded. Alas, I have to use other methods for that.
I'm a bit confused by its confidence as well.
pipeline("I'm a Republican who loves God, guns, and the GOP, but hates Trump. Is that so strange?")
>>>[{'label': 'Democrat', 'score': 0.999996542930603}]
or...
pipeline("I am happily a Republican.")
>>>[{'label': 'Democrat', 'score': 0.9964742064476013}]
If you're looking for an overall better approach (that may or may not use this model), I posted a long description of mine here: https://chewychunks.wordpress.com/2023/03/29/predicting-political-orientation-from-social-media/