Update README.md
Browse files
README.md
CHANGED
@@ -5,5 +5,5 @@ this is ja cc filter for reference from ja wiki vs random ja common crawl, and b
|
|
5 |
3. get pure text and remove content len less than 1k,
|
6 |
4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
|
7 |
5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
|
8 |
-
6. tokenize all text with "
|
9 |
7. feed tokens to fasttext to get model.bin
|
|
|
5 |
3. get pure text and remove content len less than 1k,
|
6 |
4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
|
7 |
5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
|
8 |
+
6. tokenize all text with "rinna/japanese-roberta-base"
|
9 |
7. feed tokens to fasttext to get model.bin
|