FastText Model for Pretraining Data Curation
Collection
4 items
•
Updated
This classifier classifies a text into Code or NaturalLanguage.
The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97.
The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin")) # "model_quantized.bin" for quantized version
def replace_newlines(text: str) -> str:
return re.sub("\n+", " ", text)
def predict(text_list: List[str]) -> List[dict]:
text_list = [replace_newlines(text) for text in text_list]
pred = model.predict(text_list)
return [{"label": l[0].lstrip("__label__"), "score": s[0]}
for l, s in zip(*pred)]
predict([
"""This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
"""import torch""",
"""Short text won't work"""
])
# [{'label': 'NaturalLanguage', 'score': 0.96747404},
# {'label': 'Code', 'score': 1.00001},
# {'label': 'Code', 'score': 1.000009}]
full version
precision recall f1-score support
Code 0.97 1.00 0.98 581282
NaturalLanguage 1.00 0.92 0.95 228993
accuracy 0.98 810275
macro avg 0.98 0.96 0.97 810275
weighted avg 0.98 0.98 0.98 810275
quantized version
precision recall f1-score support
Code 0.95 1.00 0.97 581282
NaturalLanguage 1.00 0.86 0.93 228993
micro avg 0.96 0.96 0.96 810275
macro avg 0.97 0.93 0.95 810275
weighted avg 0.96 0.96 0.96 810275
Code covers:
{'Assembly',
'Batchfile',
'C',
'C#',
'C++',
'CMake',
'CSS',
'Dockerfile',
'FORTRAN',
'GO',
'HTML',
'Haskell',
'Java',
'JavaScript',
'Julia',
'Lua',
'Makefile',
'PHP',
'Perl',
'PowerShell',
'Python',
'Ruby',
'Rust',
'SQL',
'Scala',
'Shell',
'TeX',
'TypeScript',
'Visual Basic'}
Markdown is disregarded as it has a high overlap with natural language.
The classifier does not handle short text well, which might not be surprising.
It has a tendency to classify short natural language into code, which you might find so in code comment.