fasttext_nearest / README.md
Hellisotherpeople's picture
Create README.md
ab22e7b
|
raw
history blame
1.92 kB

debate2vec

Word-vectors created from a large corpus of competitive debate evidence, and data extraction / processing scripts

Download Link

Github won't let me store large files in their repos.

About

Created from all publically available Cross Examination Competitive debate evidence posted by the community on Open Evidence (From 2013-2020)

Search through the original evidence by going to debate.cards

Stats about this corpus:

  • 222485 unique documents larger than 200 words (DebateSum plus some additional debate docs that weren't well-formed enough for inclusion into DebateSum)
  • 107555 unique words (showing up more than 10 times in the corpus)
  • 101 million total words

Stats about debate2vec vectors:

  • 300 dimensions, minimum number of appearances of a word was 10, trained for 100 epochs with lr set to 0.10 using FastText
  • lowercased (will release cased)
  • No subword information

The corpus includes the following topics

  • 2013-2014 Cuba/Mexico/Venezuela Economic Engagement
  • 2014-2015 Oceans
  • 2015-2016 Domestic Surveillance
  • 2016-2017 China
  • 2017-2018 Education
  • 2018-2019 Immigration
  • 2019-2020 Reducing Arms Sales

Other topics that this word vector model will handle extremely well

  • Philosophy (Especially Left-Wing / Post-modernist)
  • Law
  • Government
  • Politics

Initial release is of fasttext vectors without subword information. Future releases will include fine-tuned GPT-2 and other high end models as my GPU compute allows.

Screenshots