HuggingFaceFW
/

fineweb-edu-classifier

Text Classification

Inference Endpoints

Model card Files Files and versions Community

loubnabnl HF staff commited on May 31

Commit

e160ef4

•

1 Parent(s): 63f8cb7

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -87,7 +87,7 @@ y_true     3  3225  6330   757     7     0
 While the FineWeb-Edu classifier performs well in distinguishing high-quality educational content for FineWeb dataset, there are some limitations:
 - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
-- Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using score >= 3 as a threshold for data curation.
 - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
 The training and inference code is available on GitHub

 While the FineWeb-Edu classifier performs well in distinguishing high-quality educational content for FineWeb dataset, there are some limitations:
 - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
+- Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 3 as a threshold for data curation.
 - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
 The training and inference code is available on GitHub