debate2vec

Word-vectors created from a large corpus of competitive debate evidence, and data extraction / processing scripts

Download Link

Github won't let me store large files in their repos.

FastText Vectors Here (~260mb)

About

Created from all publically available Cross Examination Competitive debate evidence posted by the community on Open Evidence (From 2013-2020)

Search through the original evidence by going to debate.cards

Stats about this corpus:

222485 unique documents larger than 200 words (DebateSum plus some additional debate docs that weren't well-formed enough for inclusion into DebateSum)
107555 unique words (showing up more than 10 times in the corpus)
101 million total words

Stats about debate2vec vectors:

300 dimensions, minimum number of appearances of a word was 10, trained for 100 epochs with lr set to 0.10 using FastText
lowercased (will release cased)
No subword information

The corpus includes the following topics

2013-2014 Cuba/Mexico/Venezuela Economic Engagement
2014-2015 Oceans
2015-2016 Domestic Surveillance
2016-2017 China
2017-2018 Education
2018-2019 Immigration
2019-2020 Reducing Arms Sales

Other topics that this word vector model will handle extremely well

Philosophy (Especially Left-Wing / Post-modernist)
Law
Government
Politics

Initial release is of fasttext vectors without subword information. Future releases will include fine-tuned GPT-2 and other high end models as my GPU compute allows.

osanseviero
/

fasttext_nearest

debate2vec

Download Link

About

Screenshots