debate2vec
Word-vectors created from a large corpus of competitive debate evidence, and data extraction / processing scripts
Download Link
Github won't let me store large files in their repos.
- FastText Vectors Here (~260mb)
About
Created from all publically available Cross Examination Competitive debate evidence posted by the community on Open Evidence (From 2013-2020)
Search through the original evidence by going to debate.cards
Stats about this corpus:
- 222485 unique documents larger than 200 words (DebateSum plus some additional debate docs that weren't well-formed enough for inclusion into DebateSum)
- 107555 unique words (showing up more than 10 times in the corpus)
- 101 million total words
Stats about debate2vec vectors:
- 300 dimensions, minimum number of appearances of a word was 10, trained for 100 epochs with lr set to 0.10 using FastText
- lowercased (will release cased)
- No subword information
The corpus includes the following topics
- 2013-2014 Cuba/Mexico/Venezuela Economic Engagement
- 2014-2015 Oceans
- 2015-2016 Domestic Surveillance
- 2016-2017 China
- 2017-2018 Education
- 2018-2019 Immigration
- 2019-2020 Reducing Arms Sales
Other topics that this word vector model will handle extremely well
- Philosophy (Especially Left-Wing / Post-modernist)
- Law
- Government
- Politics
Initial release is of fasttext vectors without subword information. Future releases will include fine-tuned GPT-2 and other high end models as my GPU compute allows.