lfsm commited on
Commit
411bcae
1 Parent(s): 17d5c1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -5,5 +5,5 @@ this is ja cc filter for reference from ja wiki vs random ja common crawl, and b
5
  3. get pure text and remove content len less than 1k,
6
  4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
7
  5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
8
- 6. tokenize all text with "cl-tohoku/bert-base-japanese"
9
  7. feed tokens to fasttext to get model.bin
 
5
  3. get pure text and remove content len less than 1k,
6
  4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
7
  5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
8
+ 6. tokenize all text with "rinna/japanese-roberta-base"
9
  7. feed tokens to fasttext to get model.bin