HuggingFaceFW
/

fineweb-edu-classifier

@@ -24,13 +24,13 @@ inputs = tokenizer(texts, return_tensors="pt", padding="longest", truncation=Tru
 outputs = model(**inputs)
 logits = outputs.logits.squeeze(-1).float().numpy()
 score = logits.item()
-record = {
     "text": text,
     "score": score,
     "int_score": int(round(max(0, min(score, 5))))
 }
-print(record)
 ```
 ## Training
@@ -50,7 +50,37 @@ We added a classification head with a single regression output to [Snowflake-arc
 - Epochs: 20
 - Learning Rate: 3e-4
 - Evaluation Metric: F1 score
-- Final F1 Score on validation set: 82%
 ## Limitations
@@ -60,4 +90,5 @@ While the FineWeb-Edu classifier performs well in distinguishing high-quality ed
 - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using score >= 3 as a threshold for data curation.
 - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
-The training and inference code is available on GitHub (to add).

 outputs = model(**inputs)
 logits = outputs.logits.squeeze(-1).float().numpy()
 score = logits.item()
+result = {
     "text": text,
     "score": score,
     "int_score": int(round(max(0, min(score, 5))))
 }
+print(result)
 ```
 ## Training
 - Epochs: 20
 - Learning Rate: 3e-4
 - Evaluation Metric: F1 score
+**Classification report**
+We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 46867 Llama3-annotated samples.
+```
+              precision    recall  f1-score   support
+           0       0.75      0.49      0.59      5694
+           1       0.78      0.84      0.81     26512
+           2       0.57      0.61      0.59     10322
+           3       0.56      0.50      0.53      3407
+           4       0.58      0.35      0.44       807
+           5       0.33      0.01      0.02       125
+    accuracy                           0.71     46867
+   macro avg       0.60      0.47      0.50     46867
+weighted avg       0.71      0.71      0.71     46867
+```
+**Confusion matrix**
+We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
+```
+        2791  2858    45     0     0     0
+         919 22343  3180    69     1     0
+y_true     3  3225  6330   757     7     0
+           1    66  1473  1694   173     0
+           0     4    98   420   283     2
+           0     0    18    85    21     1
+                    y_pred
+```
 ## Limitations
 - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using score >= 3 as a threshold for data curation.
 - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
+The training and inference code is available on GitHub
+https://github.com/huggingface/cosmopedia/tree/main/classification