Duplicate from HuggingFaceFW/fineweb-edu-classifier

Browse files

Co-authored-by: Anton Lozhkov <[email protected]>

Files changed (15) hide show

.gitattributes +35 -0
README.md +97 -0
config.json +33 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
src/README.md +20 -0
src/run_edu_bert.py +52 -0
src/run_edu_bert.slurm +29 -0
src/train_edu_bert.py +118 -0
src/train_edu_bert.slurm +22 -0
tokenizer.json +0 -0
tokenizer_config.json +62 -0
training_args.bin +3 -0
utils/prompt.txt +14 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+---
+language:
+- en
+license: apache-2.0
+datasets:
+- HuggingFaceFW/fineweb-edu-llama3-annotations
+---
+# FineWeb-Edu classifier
+## Model summary
+This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 450k [annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) generated by [LLama3-70B-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for web samples from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
+We used this classifier to build [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
+### How to use in transformers
+To load the FineWeb-Edu classifier, use the following code:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")
+text = "This is a test sentence."
+inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
+outputs = model(**inputs)
+logits = outputs.logits.squeeze(-1).float().detach().numpy()
+score = logits.item()
+result = {
+    "text": text,
+    "score": score,
+    "int_score": int(round(max(0, min(score, 5)))),
+}
+print(result)
+# {'text': 'This is a test sentence.', 'score': 0.07964489609003067, 'int_score': 0}
+```
+## Training
+The classifier was trained on 450,000 pairs of web samples and their scores from 0 to 5, generated by Llama3. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
+Below is the prompt used for LLama3 annotations:
+    <div style="text-align: center; margin: 20px 0;">
+        <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
+    </div>
+We added a classification head with a single regression output to [Snowflake-arctic-embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head. The model achieved an F1 score of 82% when converted to a binary classifier using a score threshold of 3.
+**Training Details:**
+- Model: Snowflake-arctic-embed with a classification head
+- Dataset: 450,000 samples from Llama3 annotations
+- Epochs: 20
+- Learning Rate: 3e-4
+- Evaluation Metric: F1 score
+**Classification report**
+We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 46867 Llama3-annotated samples.
+```
+              precision    recall  f1-score   support
+           0       0.75      0.49      0.59      5694
+           1       0.78      0.84      0.81     26512
+           2       0.57      0.61      0.59     10322
+           3       0.56      0.50      0.53      3407
+           4       0.58      0.35      0.44       807
+           5       0.33      0.01      0.02       125
+    accuracy                           0.71     46867
+   macro avg       0.60      0.47      0.50     46867
+weighted avg       0.71      0.71      0.71     46867
+```
+**Confusion matrix**
+We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
+```
+        2791  2858    45     0     0     0
+         919 22343  3180    69     1     0
+y_true     3  3225  6330   757     7     0
+           1    66  1473  1694   173     0
+           0     4    98   420   283     2
+           0     0    18    85    21     1
+                    y_pred
+```
+## Limitations
+While the FineWeb-Edu classifier performs well in distinguishing high-quality educational content for FineWeb dataset, there are some limitations:
+- Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
+- Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 3 as a threshold for data curation.
+- Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
+The training and inference code is available on GitHub
+https://github.com/huggingface/cosmopedia/tree/main/classification

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "Snowflake/snowflake-arctic-embed-m",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": 0.0,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.0,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "regression",
+  "torch_dtype": "float32",
+  "transformers_version": "4.40.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:27fbfe65ce71c3fc50df7b71246c83fe9b6da329f9572d957a9355f7d290a1bc
+size 437955572

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

src/README.md ADDED Viewed

	@@ -0,0 +1,20 @@

+# Educational value classifier
+### 1. Finetune a model for educational value regression
+* edit `train_edu_bert.slurm`
+```bash
+--base_model_name="Snowflake/snowflake-arctic-embed-m" \  # BERT-like base model
+--dataset_name="HuggingFaceTB/LLM_juries_fineweb_430k_annotations" \  # Llama3-annotated eduational value dataset
+--target_column="score"
+```
+* run the training script on a SLURM cluster:
+```bash
+sbatch train_edu_bert.slurm
+```
+### 2. Annotate a dataset with the educational scores predicted by the model
+```bash
+sbatch run_edu_bert.slurm
+```

src/run_edu_bert.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import torch
+import argparse
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from datasets import load_dataset
+def main(args):
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name, torch_dtype=torch.bfloat16)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model.to(device)
+    dataset = load_dataset(args.dataset_name, args.dataset_config,
+                           split="train", cache_dir="/scratch/cosmo/cache/", num_proc=12)
+    dataset = dataset.filter(lambda x, i: i % args.num_shards == args.shard, with_indices=True, num_proc=12)
+    def compute_scores(batch):
+        inputs = tokenizer(batch[args.text_column], return_tensors="pt", padding="longest", truncation=True).to(device)
+        with torch.no_grad():
+            outputs = model(**inputs)
+            logits = outputs.logits.squeeze(-1).float().cpu().numpy()
+        batch["score"] = logits.tolist()
+        batch["int_score"] = [int(round(max(0, min(score, 5)))) for score in logits]
+        return batch
+    dataset = dataset.map(compute_scores, batched=True, batch_size=512)
+    while True:
+        try:
+            config_name = f"{args.output_dataset_config}_{args.shard}"
+            dataset.push_to_hub(args.output_dataset_name, config_name=config_name, private=True, max_shard_size="4096MB")
+            break
+        except Exception as e:
+            print(e)
+            continue
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name", type=str, default="HuggingFaceFW/fineweb-edu-classifier")
+    parser.add_argument("--dataset_name", type=str, default="HuggingFaceFW/fineweb")
+    parser.add_argument("--dataset_config", type=str, default="default")
+    parser.add_argument("--output_dataset_name", type=str, default="HuggingFaceFW/fineweb-edu")
+    parser.add_argument("--output_dataset_config", type=str, default="default")
+    parser.add_argument("--text_column", type=str, default="text")
+    parser.add_argument("--shard", type=int, required=True)
+    parser.add_argument("--num_shards", type=int, required=True)
+    args = parser.parse_args()
+    main(args)

src/run_edu_bert.slurm ADDED Viewed

	@@ -0,0 +1,29 @@

+#!/bin/bash
+#SBATCH --job-name=run_edu_bert
+#SBATCH --partition hopper-prod
+#SBATCH --qos=normal
+#SBATCH --requeue
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=12
+#SBATCH --mem-per-cpu=20G
+#SBATCH --gpus=1
+#SBATCH -o %x_%j.out
+#SBATCH -e %x_%j.err
+#SBATCH --time=7-00:00:00
+#SBATCH --array=0-127%128
+set -x -e
+source ~/.bashrc
+source "$CONDA_PREFIX/etc/profile.d/conda.sh"
+source activate pytorch
+python run_edu_bert.py \
+    --model_name="HuggingFaceFW/fineweb-edu-classifier" \
+    --dataset_name="HuggingFaceFW/fineweb" \
+    --dataset_config="CC-MAIN-2019-04" \
+    --output_dataset_name="HuggingFaceFW/fineweb-edu-annotations" \
+    --output_dataset_config="CC-MAIN-2019-04" \
+    --text_column="text" \
+    --shard ${SLURM_ARRAY_TASK_ID} \
+    --num_shards 128

src/train_edu_bert.py ADDED Viewed

	@@ -0,0 +1,118 @@

+from transformers import (
+    AutoTokenizer,
+    DataCollatorWithPadding,
+    TrainingArguments,
+    Trainer,
+    AutoModelForSequenceClassification,
+)
+from datasets import load_dataset, ClassLabel
+import numpy as np
+import evaluate
+import argparse
+import os
+from sklearn.metrics import classification_report, confusion_matrix
+def compute_metrics(eval_pred):
+    precision_metric = evaluate.load("precision")
+    recall_metric = evaluate.load("recall")
+    f1_metric = evaluate.load("f1")
+    accuracy_metric = evaluate.load("accuracy")
+    logits, labels = eval_pred
+    preds = np.round(logits.squeeze()).clip(0, 5).astype(int)
+    labels = np.round(labels.squeeze()).astype(int)
+    precision = precision_metric.compute(
+        predictions=preds, references=labels, average="macro"
+    )["precision"]
+    recall = recall_metric.compute(
+        predictions=preds, references=labels, average="macro"
+    )["recall"]
+    f1 = f1_metric.compute(predictions=preds, references=labels, average="macro")["f1"]
+    accuracy = accuracy_metric.compute(predictions=preds, references=labels)["accuracy"]
+    report = classification_report(labels, preds)
+    cm = confusion_matrix(labels, preds)
+    print("Validation Report:\n" + report)
+    print("Confusion Matrix:\n" + str(cm))
+    return {
+        "precision": precision,
+        "recall": recall,
+        "f1_macro": f1,
+        "accuracy": accuracy,
+    }
+def main(args):
+    dataset = load_dataset(
+        args.dataset_name, split="train", cache_dir="/scratch/cosmo/cache/", num_proc=8
+    )
+    dataset = dataset.map(
+        lambda x: {args.target_column: np.clip(int(x[args.target_column]), 0, 5)}, num_proc=8
+    )
+    dataset = dataset.cast_column(
+        args.target_column, ClassLabel(names=[str(i) for i in range(6)])
+    )
+    dataset = dataset.train_test_split(
+        train_size=0.9, seed=42, stratify_by_column=args.target_column
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.base_model_name)
+    def preprocess(examples):
+        batch = tokenizer(examples["text"], truncation=True)
+        batch["labels"] = np.float32(examples[args.target_column])
+        return batch
+    dataset = dataset.map(preprocess, batched=True)
+    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+    model = AutoModelForSequenceClassification.from_pretrained(args.base_model_name, num_labels=1, classifier_dropout=0.0, hidden_dropout_prob=0.0)
+    for param in model.bert.embeddings.parameters():
+        param.requires_grad = False
+    for param in model.bert.encoder.parameters():
+        param.requires_grad = False
+    training_args = TrainingArguments(
+        output_dir=args.checkpoint_dir,
+        evaluation_strategy="steps",
+        save_strategy="steps",
+        eval_steps=1000,
+        save_steps=1000,
+        logging_steps=100,
+        learning_rate=3e-4,
+        num_train_epochs=20,
+        seed=0,
+        per_device_train_batch_size=256,
+        per_device_eval_batch_size=128,
+        load_best_model_at_end=True,
+        metric_for_best_model="f1_macro",
+        greater_is_better=True,
+        bf16=True,
+    )
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset["train"],
+        eval_dataset=dataset["test"],
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics,
+    )
+    trainer.train()
+    trainer.save_model(os.path.join(args.checkpoint_dir, "final"))
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_model_name", type=str, default="Snowflake/snowflake-arctic-embed-m")
+    parser.add_argument("--dataset_name", type=str, default="HuggingFaceFW/fineweb-edu-llama3-annotations")
+    parser.add_argument("--target_column", type=str, default="score")
+    parser.add_argument("--checkpoint_dir", type=str, default="/fsx/anton/cosmopedia/edu_score/bert_snowflake_regression")
+    args = parser.parse_args()
+    main(args)

src/train_edu_bert.slurm ADDED Viewed

	@@ -0,0 +1,22 @@

+#!/bin/bash
+#SBATCH --job-name=train_edu_bert
+#SBATCH --partition hopper-prod
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem-per-cpu=20G
+#SBATCH --gpus=1
+#SBATCH -o %x_%j.out
+#SBATCH -e %x_%j.err
+#SBATCH --time=1-00:00:00
+set -x -e
+source ~/.bashrc
+source "$CONDA_PREFIX/etc/profile.d/conda.sh"
+source activate pytorch
+python train_edu_bert.py \
+    --base_model_name="Snowflake/snowflake-arctic-embed-m" \
+    --dataset_name="HuggingFaceFW/fineweb-edu-llama3-annotations \
+    --target_column="score"\
+    --checkpoint_dir="/fsx/anton/cosmopedia/edu_score/snowflake_regression_median_jury"

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:efc02ff3cbe38771032be7435dde1ef98e3d99cdd38f1fb9a9c2f710ef949f0b
+size 5048

utils/prompt.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
+- Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and promotional material.
+- Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style.
+- Award a third point if the extract is appropriate for educational use and introduces key concepts relevant to school curricula. It is coherent though it may not be comprehensive or could include some extraneous information. It may resemble an introductory section of a textbook or a basic tutorial that is suitable for learning but has notable limitations like treating concepts that are too complex for grade school students.
+- Grant a fourth point if the extract highly relevant and beneficial for educational purposes for a level not higher than grade school, exhibiting a clear and consistent writing style. It could be similar to a chapter from a textbook or a tutorial, offering substantial educational content, including exercises and solutions, with minimal irrelevant information, and the concepts aren't too advanced for grade school students. The content is coherent, focused, and valuable for structured learning.
+- Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for teaching either at primary school or grade school. It follows detailed reasoning, the writing style is easy to follow and offers profound and thorough insights into the subject matter, devoid of any non-educational or complex content.
+The extract:
+<EXAMPLE>.
+After examining the extract:
+- Briefly justify your total score, up to 100 words.
+- Conclude with the score using the format: "Educational score:  <total points>"

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff