Edit model card

training_nen

This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-english-only-examples dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

The run_clm.py script from the transformers library was used. Training was distributed on two NVIDIA Quadro RTX 6000 GPUs:

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 nohup python -m torch.distributed.launch \
--nproc_per_node=2 run_clm.py --output_dir="./training_nen" \
--model_type="gpt2" \
--config_name="./training" \
--tokenizer_name="./training" \
--dataset_name="RaiBP/openwebtext2-first-30-chunks-english-only-examples" \
--do_train \
--per_device_train_batch_size 8 \
--block_size="1024" \
--learning_rate="5e-3" --warmup_steps="1000" \
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
--overwrite_output_dir \
--num_train_epochs="1" \
--logging_steps="500" \
--save_steps="5000" --preprocessing_num_workers="16" \
--gradient_accumulation_steps="4" --report_to="tensorboard" \
--logging_dir="./log_nen"  > command_nen_log.log 2>&1 &

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.005
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 1.0

Training results

Evaluation results

Perplexity on random 2000 examples of the target language's Wikipedia dataset, using the code provided in the perplexity docs, with 512 tokes of stride. Baseline is the result from evaluating OpenAI's GPT-2 on the same examples.

Target language PPL Baseline PPL
en 42.175106048583984 26.562532424926758
de 225.5620574951172 56.907039642333984
es 184.9262237548828 55.592445373535156
fr 170.0771026611328 49.69472885131836
it 238.36192321777344 75.95120239257812
pt 203.595947265625
nl 225.9720001220703

The following script was used for evaluation

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm
import random

# Set the seed for reproducibility
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-non-english"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

target_language_dataset = "20231101.de" # change here for other languages

dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train")
num_examples = 2000
random_numbers = list(np.random.randint(0, len(dataset), num_examples))
examples = []
for i in tqdm(random_numbers):
    examples.append(dataset[int(i)]["text"])
encodings = tokenizer("\n\n".join(examples), return_tensors="pt")

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

print("Perplexity: ", ppl.item())

Framework versions

  • Transformers 4.37.0.dev0
  • Pytorch 1.13.0
  • Datasets 2.16.0
  • Tokenizers 0.15.0
Downloads last month
27
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-non-english