# This notebook by Zack DeSario is a remix / combination of many sources as all good code is.  
The code is mainly from [@DigitalSreeni](https://youtu.be/DxygPxcfW_I).  Their code cites the [huggingface official tutorial](https://huggingface.co/transformers/v2.2.0/pretrained_models.html).



In [None]:
!pip install transformers
!pip install torch
!pip install transformers[torch]

In [1]:
import os
import re
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments


In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

Required functions to read text from various files located in a directory. Files can be a mix of pdf, docx, or txt.

In [3]:
### THIS CODE IS 100% WRITTEN BY THE FIRST SOURCE.  VERY HELPFUL FUNCTIONS, TY.
# Functions to read different file types
def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text


# ANOTHER HELPER FUNCTION
def generate_response(model, tokenizer, prompt, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


## Now load the base model and test it to see if it already does what we need to do or not....

In [4]:
# Set up the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")  #also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")  #also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

In [5]:
prompt = 'Write a script for the TV show Futurama about Fry getting stuck in a hole.'
response = generate_response(model, tokenizer, prompt, max_length=200)
print(response)

Write a script for the TV show Futurama about Fry getting stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.

Fry: I'm stuck in a hole.




In [16]:
prompt = 'Who is Fry TV show Futurama?  Describe them in detail.'
response = generate_response(model, tokenizer, prompt, max_length=200)
print(response)

Who is Fry TV show Futurama?  Describe them in detail.

Futurama is a comedy show that is based on the comic strip "Futurama" by David Lynch. It is a parody of the popular television series "Futurama" and is produced by Fox Television Studios. The show is produced by Fox Television Studios and is produced by Fox Television Studios, a division of 21st Century Fox.

Futurama is a comedy show that is based on the comic strip "Futurama" by David Lynch. It is a parody of the popular television series "Futurama" and is produced by Fox Television Studios. The show is produced by Fox Television Studios and is produced by Fox Television Studios, a division of 21st Century Fox. What is the name of the show?

Futurama is a comedy show that is based on the comic strip "Futurama" by David Lynch. It


# Mmkay, it clearly does not know who Fry is or how to write a TV Script.

### Lets train it to learn how to write a TV script for Futurama.  


## Adding your data
1. Open the side panel, click on the folder icon, create a new folder called `my_data`, and drag and drop your data into that side panel. I will demonstrate during class.

2. Also, create a new folder called `my_trained_model`.  That is where we will temporarily store our trained model.

Load your data
* You can download the data I used here:  UPLOAD LINK SOON.

In [6]:
directory = "/content/my_data/"  # Replace with the path to your directory containing the files
model_output_path = "/content/my_trained_models/"
train_fraction=0.8
# Read documents from the directory
combined_text = read_documents_from_directory(directory)
combined_text = re.sub(r'\n+', '\n', combined_text).strip()  # Remove excess newline characters

# Split the text into training and validation sets
split_index = int(train_fraction * len(combined_text))
train_text = combined_text[:split_index]
val_text = combined_text[split_index:]

# Save the training and validation data as text files
with open("train.txt", "w") as f:
    f.write(train_text)
with open("val.txt", "w") as f:
    f.write(val_text)


In [10]:
len(train_text)
print(train_text[:1000])

FUTURAMA
                                       Episode 108 
                                "A BIG PIECE OF GARBAGE"
                                           By
                                      Lewis Morton
                         Transcribed by Dave, The Neutral Planet
               
               [Planet Express: Meeting Room. The crew are sat around the table.]
 
               
               
                                     FARNSWORTH
                         Good news, everyone. Tomorrow you'll 
                         be making a delivery to Ebola 9, the 
                         virus planet.
 
               
                                     HERMES
                         Why can't they go today?
               
                                     FARNSWORTH
                         Because tonight's a special night and 
                         I want all of you to be alive. It's 
                         the Academy of Inventors annual symposium.
 
   

The train_chatbot function uses the combined text data to train a GPT-2 model using the provided training arguments. The resulting trained model and tokenizer are then saved to a specified output directory.

In [7]:

# Prepare the dataset
train_dataset = TextDataset(tokenizer=tokenizer, file_path="train.txt", block_size=128)
val_dataset = TextDataset(tokenizer=tokenizer, file_path="val.txt", block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Set up the training arguments
training_args = TrainingArguments(
    output_dir=model_output_path,
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=33,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
)

# Train the model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)




In [8]:
## THIS TAKES 30 MINS FOR JUST 10 EPOCHS SO I AM NOT GOING TO DUE THAT DURING CLASS....
## AND ~2HRS FOR 33 EPOCHS
trainer.train()

Step,Training Loss
500,0.7237
1000,0.5069
1500,0.3828
2000,0.2888
2500,0.229
3000,0.1998
3500,0.1765
4000,0.1588
4500,0.1478
5000,0.1379


TrainOutput(global_step=12936, training_loss=0.1753516689670919, metrics={'train_runtime': 6288.4681, 'train_samples_per_second': 8.218, 'train_steps_per_second': 2.057, 'total_flos': 1.1998348622954496e+16, 'train_loss': 0.1753516689670919, 'epoch': 33.0})

In [9]:
# Save the model
trainer.save_model(model_output_path)

# Save the tokenizer
tokenizer.save_pretrained(model_output_path)


('/content/my_trained_models/tokenizer_config.json',
 '/content/my_trained_models/special_tokens_map.json',
 '/content/my_trained_models/vocab.json',
 '/content/my_trained_models/merges.txt',
 '/content/my_trained_models/added_tokens.json')

In [10]:
print("SAVED MODELS LOCALLY YO!!!!!!")

SAVED MODELS LOCALLY YO!!!!!!


In [11]:
directory = "/content/my_data/"  # Replace with the path to your directory containing the files
model_output_path = "/content/my_trained_models/"

model = GPT2LMHeadModel.from_pretrained(model_output_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)


In [14]:

# Test the chatbot
prompt = "Write a TV show script for the TV show Futurama about Fry getting stuck in a hole."  # Replace with your desired prompt
# prompt = "What is bulk metallic glass?"  # Replace with your desired prompt

response = generate_response(model, tokenizer, prompt, max_length=1000)
print("Generated response:", response)

Generated response: Write a TV show script for the TV show Futurama about Fry getting stuck in a hole.
 
                         
               
                                     FRY
                          Yeah, that'd be a timesaver.
                
                [Cut to: Planet Express: Fry's Bedroom. The crew are asleep.]
                
                                           BENDER
                          Hey, you don't wanna hear about that 
                       trouble with the TV!
 
            
                                  BENDER
                     Let's just say it'll put a smile on your 
                   face.
           
                                  BENDER
                     OK, OK, keep your space pants on. I'll 
                    take care of this.
 
            
           [Cut to: Planet Express: Fry's Bedroom. The crew are asleep.]
            
                                      BENDER
                          Alright, OK, I'll 

In [15]:
## PUSH THE MODELS TO YOUR HUGGING-FACE.

model.push_to_hub(repo_id='KingZack/future-futurama-maker')
tokenizer.push_to_hub('KingZack/future-futurama-maker')




model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/KingZack/future-futurama-maker/commit/4f79dd3471035e7a73e4c8a509d1377f085f5dd4', commit_message='Upload tokenizer', commit_description='', oid='4f79dd3471035e7a73e4c8a509d1377f085f5dd4', pr_url=None, pr_revision=None, pr_num=None)

### check out the model you made in the offical hub.  
--> https://huggingface.co/KingZack/future-futurama-maker

## Now load it from the hub and test it out.

In [16]:
# Use a pipeline as a high-level helper
# from transformers import pipeline

# pipe = pipeline("text-generation", model="KingZack/future-futurama-maker")


# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("KingZack/future-futurama-maker")
model = AutoModelForCausalLM.from_pretrained("KingZack/future-futurama-maker")

Downloading model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

In [17]:
# Test the chatbot
prompt = 'Write a script for the TV show Futurama about Fry getting stuck in a hole.'

response = generate_response(model, tokenizer, prompt, max_length=1000)
print("Generated response:", response)

Generated response: Write a script for the TV show Futurama about Fry getting stuck in a hole.
 
                         
               
                                     FRY
                          Maybe I should take Fry on the Luna 
                           Rover ride. You get to wear a space 
                           suit and drive around on the surface. 
                         
               
              [The ride takes Fry and Bender by rocket ship. They open the door 
            and look around.]
 
            
                                   LEELA
                     OK, guys, let's get to work! I'll be 
 
                    out in a second.
 
           
         [Cut to: Outside Luna Park. The crew walk across the surface, reading a map.]
 
          
                                    FARNSWORTH
 
                       OK, guys, let's get to work! I'll be 
                    out in a second.
            
            [Cut to: Outside Luna Park. The cr