metadata

language:
  - en
license: apache-2.0

FineWeb-Edu classifier

Model summary

This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 450k annotations generated by LLama3-70B-instruct for web samples from FineWeb dataset.

We used this classifier to build FineWeb-Edu dataset.

How to use in transformers

To load the FineWeb-Edu classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")
model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")

inputs = tokenizer("Your text here", return_tensors="pt", padding="longest", truncation=True)
inputs = tokenizer(texts, return_tensors="pt", padding="longest", truncation=True).
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().numpy()
score = logits.item()
record = {
    "text": text,
    "score": score,
    "int_score": int(round(max(0, min(score, 5))))
}

print(record)

Training

The classifier was trained on 450,000 pairs of web samples and their scores from 0 to 5, generated by Llama3. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.

Below is the prompt used for LLama3 annotations:

We added a classification head with a single regression output to Snowflake-arctic-embed and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head. The model achieved an F1 score of 82% when converted to a binary classifier using a score threshold of 3.

Training Details:

Model: Snowflake-arctic-embed with a classification head
Dataset: 450,000 samples from Llama3 annotations
Epochs: 20
Learning Rate: 3e-4
Evaluation Metric: F1 score
Final F1 Score on validation set: 82%

Limitations

While the FineWeb-Edu classifier performs well in distinguishing high-quality educational content for FineWeb dataset, there are some limitations:

Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments.
Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.

The training and inference code is available on GitHub (to add).