--- language: - en license: apache-2.0 --- # FineWeb-Edu classifier ## Model summary This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 450k annotations generated by [LLama3-70B-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for web samples from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset. We used this classifier to build [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset. ### How to use in transformers To load the FineWeb-Edu classifier, use the following code: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/fineweb-edu-classifier") model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceTB/fineweb-edu-classifier") inputs = tokenizer("Your text here", return_tensors="pt", padding="longest", truncation=True) inputs = tokenizer(texts, return_tensors="pt", padding="longest", truncation=True). outputs = model(**inputs) logits = outputs.logits.squeeze(-1).float().numpy() score = logits.item() record = { "text": text, "score": score, "int_score": int(round(max(0, min(score, 5)))) } print(record) ``` ## Training The classifier was trained on 450,000 pairs of web samples and their scores from 0 to 5, generated by Llama3. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational. Below is the prompt used for LLama3 annotations: