davanstrien HF staff anton-l HF staff commited on
Commit
6d16003
0 Parent(s):

Duplicate from HuggingFaceFW/fineweb-edu-classifier

Browse files

Co-authored-by: Anton Lozhkov <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ datasets:
6
+ - HuggingFaceFW/fineweb-edu-llama3-annotations
7
+ ---
8
+
9
+ # FineWeb-Edu classifier
10
+
11
+ ## Model summary
12
+ This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 450k [annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) generated by [LLama3-70B-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for web samples from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
13
+
14
+ We used this classifier to build [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
15
+ ### How to use in transformers
16
+ To load the FineWeb-Edu classifier, use the following code:
17
+
18
+ ```python
19
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
20
+
21
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")
22
+ model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")
23
+
24
+ text = "This is a test sentence."
25
+ inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
26
+ outputs = model(**inputs)
27
+ logits = outputs.logits.squeeze(-1).float().detach().numpy()
28
+ score = logits.item()
29
+ result = {
30
+ "text": text,
31
+ "score": score,
32
+ "int_score": int(round(max(0, min(score, 5)))),
33
+ }
34
+
35
+ print(result)
36
+ # {'text': 'This is a test sentence.', 'score': 0.07964489609003067, 'int_score': 0}
37
+ ```
38
+
39
+ ## Training
40
+ The classifier was trained on 450,000 pairs of web samples and their scores from 0 to 5, generated by Llama3. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
41
+
42
+ Below is the prompt used for LLama3 annotations:
43
+ <div style="text-align: center; margin: 20px 0;">
44
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
45
+ </div>
46
+
47
+ We added a classification head with a single regression output to [Snowflake-arctic-embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head. The model achieved an F1 score of 82% when converted to a binary classifier using a score threshold of 3.
48
+
49
+ **Training Details:**
50
+
51
+ - Model: Snowflake-arctic-embed with a classification head
52
+ - Dataset: 450,000 samples from Llama3 annotations
53
+ - Epochs: 20
54
+ - Learning Rate: 3e-4
55
+ - Evaluation Metric: F1 score
56
+
57
+ **Classification report**
58
+
59
+ We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 46867 Llama3-annotated samples.
60
+ ```
61
+ precision recall f1-score support
62
+
63
+ 0 0.75 0.49 0.59 5694
64
+ 1 0.78 0.84 0.81 26512
65
+ 2 0.57 0.61 0.59 10322
66
+ 3 0.56 0.50 0.53 3407
67
+ 4 0.58 0.35 0.44 807
68
+ 5 0.33 0.01 0.02 125
69
+
70
+ accuracy 0.71 46867
71
+ macro avg 0.60 0.47 0.50 46867
72
+ weighted avg 0.71 0.71 0.71 46867
73
+ ```
74
+
75
+ **Confusion matrix**
76
+
77
+ We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
78
+ ```
79
+ 2791 2858 45 0 0 0
80
+ 919 22343 3180 69 1 0
81
+ y_true 3 3225 6330 757 7 0
82
+ 1 66 1473 1694 173 0
83
+ 0 4 98 420 283 2
84
+ 0 0 18 85 21 1
85
+ y_pred
86
+ ```
87
+
88
+
89
+ ## Limitations
90
+ While the FineWeb-Edu classifier performs well in distinguishing high-quality educational content for FineWeb dataset, there are some limitations:
91
+
92
+ - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
93
+ - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 3 as a threshold for data curation.
94
+ - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
95
+
96
+ The training and inference code is available on GitHub
97
+ https://github.com/huggingface/cosmopedia/tree/main/classification
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-m",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": 0.0,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.0,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "problem_type": "regression",
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.40.1",
30
+ "type_vocab_size": 2,
31
+ "use_cache": true,
32
+ "vocab_size": 30522
33
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27fbfe65ce71c3fc50df7b71246c83fe9b6da329f9572d957a9355f7d290a1bc
3
+ size 437955572
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
src/README.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Educational value classifier
2
+
3
+ ### 1. Finetune a model for educational value regression
4
+
5
+ * edit `train_edu_bert.slurm`
6
+ ```bash
7
+ --base_model_name="Snowflake/snowflake-arctic-embed-m" \ # BERT-like base model
8
+ --dataset_name="HuggingFaceTB/LLM_juries_fineweb_430k_annotations" \ # Llama3-annotated eduational value dataset
9
+ --target_column="score"
10
+ ```
11
+ * run the training script on a SLURM cluster:
12
+ ```bash
13
+ sbatch train_edu_bert.slurm
14
+ ```
15
+
16
+ ### 2. Annotate a dataset with the educational scores predicted by the model
17
+
18
+ ```bash
19
+ sbatch run_edu_bert.slurm
20
+ ```
src/run_edu_bert.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import argparse
3
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
4
+ from datasets import load_dataset
5
+
6
+
7
+ def main(args):
8
+ tokenizer = AutoTokenizer.from_pretrained(args.model_name)
9
+ model = AutoModelForSequenceClassification.from_pretrained(args.model_name, torch_dtype=torch.bfloat16)
10
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
11
+ model.to(device)
12
+
13
+ dataset = load_dataset(args.dataset_name, args.dataset_config,
14
+ split="train", cache_dir="/scratch/cosmo/cache/", num_proc=12)
15
+ dataset = dataset.filter(lambda x, i: i % args.num_shards == args.shard, with_indices=True, num_proc=12)
16
+
17
+ def compute_scores(batch):
18
+ inputs = tokenizer(batch[args.text_column], return_tensors="pt", padding="longest", truncation=True).to(device)
19
+ with torch.no_grad():
20
+ outputs = model(**inputs)
21
+ logits = outputs.logits.squeeze(-1).float().cpu().numpy()
22
+
23
+ batch["score"] = logits.tolist()
24
+ batch["int_score"] = [int(round(max(0, min(score, 5)))) for score in logits]
25
+ return batch
26
+
27
+ dataset = dataset.map(compute_scores, batched=True, batch_size=512)
28
+
29
+ while True:
30
+ try:
31
+ config_name = f"{args.output_dataset_config}_{args.shard}"
32
+ dataset.push_to_hub(args.output_dataset_name, config_name=config_name, private=True, max_shard_size="4096MB")
33
+ break
34
+ except Exception as e:
35
+ print(e)
36
+ continue
37
+
38
+
39
+ if __name__ == "__main__":
40
+ parser = argparse.ArgumentParser()
41
+
42
+ parser.add_argument("--model_name", type=str, default="HuggingFaceFW/fineweb-edu-classifier")
43
+ parser.add_argument("--dataset_name", type=str, default="HuggingFaceFW/fineweb")
44
+ parser.add_argument("--dataset_config", type=str, default="default")
45
+ parser.add_argument("--output_dataset_name", type=str, default="HuggingFaceFW/fineweb-edu")
46
+ parser.add_argument("--output_dataset_config", type=str, default="default")
47
+ parser.add_argument("--text_column", type=str, default="text")
48
+ parser.add_argument("--shard", type=int, required=True)
49
+ parser.add_argument("--num_shards", type=int, required=True)
50
+
51
+ args = parser.parse_args()
52
+ main(args)
src/run_edu_bert.slurm ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH --job-name=run_edu_bert
3
+ #SBATCH --partition hopper-prod
4
+ #SBATCH --qos=normal
5
+ #SBATCH --requeue
6
+ #SBATCH --nodes=1
7
+ #SBATCH --ntasks-per-node=1
8
+ #SBATCH --cpus-per-task=12
9
+ #SBATCH --mem-per-cpu=20G
10
+ #SBATCH --gpus=1
11
+ #SBATCH -o %x_%j.out
12
+ #SBATCH -e %x_%j.err
13
+ #SBATCH --time=7-00:00:00
14
+ #SBATCH --array=0-127%128
15
+
16
+ set -x -e
17
+ source ~/.bashrc
18
+ source "$CONDA_PREFIX/etc/profile.d/conda.sh"
19
+ source activate pytorch
20
+
21
+ python run_edu_bert.py \
22
+ --model_name="HuggingFaceFW/fineweb-edu-classifier" \
23
+ --dataset_name="HuggingFaceFW/fineweb" \
24
+ --dataset_config="CC-MAIN-2019-04" \
25
+ --output_dataset_name="HuggingFaceFW/fineweb-edu-annotations" \
26
+ --output_dataset_config="CC-MAIN-2019-04" \
27
+ --text_column="text" \
28
+ --shard ${SLURM_ARRAY_TASK_ID} \
29
+ --num_shards 128
src/train_edu_bert.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import (
2
+ AutoTokenizer,
3
+ DataCollatorWithPadding,
4
+ TrainingArguments,
5
+ Trainer,
6
+ AutoModelForSequenceClassification,
7
+ )
8
+ from datasets import load_dataset, ClassLabel
9
+ import numpy as np
10
+ import evaluate
11
+ import argparse
12
+ import os
13
+ from sklearn.metrics import classification_report, confusion_matrix
14
+
15
+
16
+ def compute_metrics(eval_pred):
17
+ precision_metric = evaluate.load("precision")
18
+ recall_metric = evaluate.load("recall")
19
+ f1_metric = evaluate.load("f1")
20
+ accuracy_metric = evaluate.load("accuracy")
21
+
22
+ logits, labels = eval_pred
23
+ preds = np.round(logits.squeeze()).clip(0, 5).astype(int)
24
+ labels = np.round(labels.squeeze()).astype(int)
25
+ precision = precision_metric.compute(
26
+ predictions=preds, references=labels, average="macro"
27
+ )["precision"]
28
+ recall = recall_metric.compute(
29
+ predictions=preds, references=labels, average="macro"
30
+ )["recall"]
31
+ f1 = f1_metric.compute(predictions=preds, references=labels, average="macro")["f1"]
32
+ accuracy = accuracy_metric.compute(predictions=preds, references=labels)["accuracy"]
33
+
34
+ report = classification_report(labels, preds)
35
+ cm = confusion_matrix(labels, preds)
36
+ print("Validation Report:\n" + report)
37
+ print("Confusion Matrix:\n" + str(cm))
38
+
39
+ return {
40
+ "precision": precision,
41
+ "recall": recall,
42
+ "f1_macro": f1,
43
+ "accuracy": accuracy,
44
+ }
45
+
46
+
47
+ def main(args):
48
+ dataset = load_dataset(
49
+ args.dataset_name, split="train", cache_dir="/scratch/cosmo/cache/", num_proc=8
50
+ )
51
+ dataset = dataset.map(
52
+ lambda x: {args.target_column: np.clip(int(x[args.target_column]), 0, 5)}, num_proc=8
53
+ )
54
+
55
+ dataset = dataset.cast_column(
56
+ args.target_column, ClassLabel(names=[str(i) for i in range(6)])
57
+ )
58
+ dataset = dataset.train_test_split(
59
+ train_size=0.9, seed=42, stratify_by_column=args.target_column
60
+ )
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(args.base_model_name)
63
+
64
+ def preprocess(examples):
65
+ batch = tokenizer(examples["text"], truncation=True)
66
+ batch["labels"] = np.float32(examples[args.target_column])
67
+ return batch
68
+
69
+ dataset = dataset.map(preprocess, batched=True)
70
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
71
+ model = AutoModelForSequenceClassification.from_pretrained(args.base_model_name, num_labels=1, classifier_dropout=0.0, hidden_dropout_prob=0.0)
72
+
73
+ for param in model.bert.embeddings.parameters():
74
+ param.requires_grad = False
75
+ for param in model.bert.encoder.parameters():
76
+ param.requires_grad = False
77
+
78
+ training_args = TrainingArguments(
79
+ output_dir=args.checkpoint_dir,
80
+ evaluation_strategy="steps",
81
+ save_strategy="steps",
82
+ eval_steps=1000,
83
+ save_steps=1000,
84
+ logging_steps=100,
85
+ learning_rate=3e-4,
86
+ num_train_epochs=20,
87
+ seed=0,
88
+ per_device_train_batch_size=256,
89
+ per_device_eval_batch_size=128,
90
+ load_best_model_at_end=True,
91
+ metric_for_best_model="f1_macro",
92
+ greater_is_better=True,
93
+ bf16=True,
94
+ )
95
+
96
+ trainer = Trainer(
97
+ model=model,
98
+ args=training_args,
99
+ train_dataset=dataset["train"],
100
+ eval_dataset=dataset["test"],
101
+ tokenizer=tokenizer,
102
+ data_collator=data_collator,
103
+ compute_metrics=compute_metrics,
104
+ )
105
+
106
+ trainer.train()
107
+ trainer.save_model(os.path.join(args.checkpoint_dir, "final"))
108
+
109
+
110
+ if __name__ == "__main__":
111
+ parser = argparse.ArgumentParser()
112
+ parser.add_argument("--base_model_name", type=str, default="Snowflake/snowflake-arctic-embed-m")
113
+ parser.add_argument("--dataset_name", type=str, default="HuggingFaceFW/fineweb-edu-llama3-annotations")
114
+ parser.add_argument("--target_column", type=str, default="score")
115
+ parser.add_argument("--checkpoint_dir", type=str, default="/fsx/anton/cosmopedia/edu_score/bert_snowflake_regression")
116
+ args = parser.parse_args()
117
+
118
+ main(args)
src/train_edu_bert.slurm ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH --job-name=train_edu_bert
3
+ #SBATCH --partition hopper-prod
4
+ #SBATCH --nodes=1
5
+ #SBATCH --ntasks-per-node=1
6
+ #SBATCH --cpus-per-task=16
7
+ #SBATCH --mem-per-cpu=20G
8
+ #SBATCH --gpus=1
9
+ #SBATCH -o %x_%j.out
10
+ #SBATCH -e %x_%j.err
11
+ #SBATCH --time=1-00:00:00
12
+
13
+ set -x -e
14
+ source ~/.bashrc
15
+ source "$CONDA_PREFIX/etc/profile.d/conda.sh"
16
+ source activate pytorch
17
+
18
+ python train_edu_bert.py \
19
+ --base_model_name="Snowflake/snowflake-arctic-embed-m" \
20
+ --dataset_name="HuggingFaceFW/fineweb-edu-llama3-annotations \
21
+ --target_column="score"\
22
+ --checkpoint_dir="/fsx/anton/cosmopedia/edu_score/snowflake_regression_median_jury"
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "max_length": 512,
49
+ "model_max_length": 512,
50
+ "pad_to_multiple_of": null,
51
+ "pad_token": "[PAD]",
52
+ "pad_token_type_id": 0,
53
+ "padding_side": "right",
54
+ "sep_token": "[SEP]",
55
+ "stride": 0,
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "BertTokenizer",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "[UNK]"
62
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:efc02ff3cbe38771032be7435dde1ef98e3d99cdd38f1fb9a9c2f710ef949f0b
3
+ size 5048
utils/prompt.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
2
+
3
+ - Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and promotional material.
4
+ - Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style.
5
+ - Award a third point if the extract is appropriate for educational use and introduces key concepts relevant to school curricula. It is coherent though it may not be comprehensive or could include some extraneous information. It may resemble an introductory section of a textbook or a basic tutorial that is suitable for learning but has notable limitations like treating concepts that are too complex for grade school students.
6
+ - Grant a fourth point if the extract highly relevant and beneficial for educational purposes for a level not higher than grade school, exhibiting a clear and consistent writing style. It could be similar to a chapter from a textbook or a tutorial, offering substantial educational content, including exercises and solutions, with minimal irrelevant information, and the concepts aren't too advanced for grade school students. The content is coherent, focused, and valuable for structured learning.
7
+ - Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for teaching either at primary school or grade school. It follows detailed reasoning, the writing style is easy to follow and offers profound and thorough insights into the subject matter, devoid of any non-educational or complex content.
8
+
9
+ The extract:
10
+ <EXAMPLE>.
11
+
12
+ After examining the extract:
13
+ - Briefly justify your total score, up to 100 words.
14
+ - Conclude with the score using the format: "Educational score: <total points>"
vocab.txt ADDED
The diff for this file is too large to render. See raw diff