Upload 5 files

Browse files

Files changed (5) hide show

Casual Language Modeling_Student.ipynb +1 -0
Causal Language Modeling_Student.ipynb +1 -0
special_tokens_map.json +15 -0
tokenizer.json +0 -0
tokenizer_config.json +16 -0

Casual Language Modeling_Student.ipynb ADDED Viewed

	@@ -0,0 +1 @@

+ {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"1HFDC1DdA836J9Sb2jiftwgcBr8lU1fuR","timestamp":1695365565765}],"gpuType":"T4"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# Casual Language Modeling"],"metadata":{"id":"WWwTKhKrSu-Y"}},{"cell_type":"markdown","source":["Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on\n","the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model."],"metadata":{"id":"5Q1NdBUvxSpb"}},{"cell_type":"markdown","source":["Before you begin, make sure you have the necessary library installed:\n","\n","```bash\n","pip install transformers\n","```"],"metadata":{"id":"3QC3EQyzuwTa"}},{"cell_type":"code","source":["pip install transformers"],"metadata":{"id":"rAnhbPHDPcon"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"nwATTgkGPbkb"},"outputs":[],"source":["import torch"]},{"cell_type":"code","source":["import transformers"],"metadata":{"id":"xf4Syb3IPjfX"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["The BLOOM model has been proposed with its various versions through the BigScience Workshop. BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact. The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages."],"metadata":{"id":"v-9wILDcvfED"}},{"cell_type":"code","source":["from transformers import BloomForCausalLM"],"metadata":{"id":"bOheix9FP9A4"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from transformers import BloomTokenizerFast"],"metadata":{"id":"oz3M9SVZQAfY"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["load the tokenizer and configuration objects"],"metadata":{"id":"ML22rhjUwpG6"}},{"cell_type":"code","source":["model = BloomForCausalLM.from_pretrained(\"bigscience/bloom-560m\")"],"metadata":{"id":"F0CrLrOTQC-v"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["tokenizer = BloomTokenizerFast.from_pretrained(\"bigscience/bloom-560m\")"],"metadata":{"id":"4-JHUxnxQKzq"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Define an input sentence, tokenize it and predict the next tokens int eh senetence"],"metadata":{"id":"2bD8ZoWHxBKb"}},{"cell_type":"code","source":["prompt = \"I really wish I could\" #This is the text string that the LLM will finish\n","result_length = len(prompt.split()) +15 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"s_nr12LEQt2T"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Three different strategies for generating text using Casual language Modeling"],"metadata":{"id":"0VOX2BaL2Myk"}},{"cell_type":"code","source":["# Greedy Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length\n"," )[0]))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"hbYCbSOfR9EN","executionInfo":{"status":"ok","timestamp":1696140138591,"user_tz":-480,"elapsed":3792,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}},"outputId":"4e625b77-0bb6-4f21-da55-65652357e21f"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["I really wish I could have a better idea of what I am doing. I am not sure what\n"]}]},{"cell_type":"code","source":["# Beam Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2, # number of beams of bream search\n"," no_repeat_ngram_size=2, # size of n-grams to avoid repoeating\n"," early_stopping=True # whether to stop the generation process early of the model predicts an end of sequence token\n"," )[0]))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"rIfYc0n7SGLW","executionInfo":{"status":"ok","timestamp":1695365801009,"user_tz":-480,"elapsed":6346,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}},"outputId":"19fd2a3b-0ad5-4516-cfc0-817717280493"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["I really wish I could do it myself, but I can't. I don't have the time or the\n"]}]},{"cell_type":"code","source":["# Sampling Top-k + Top-p\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," do_sample=True,\n"," top_k=50, # consider top 50 most likely token\n"," top_p=0.9 # consider tokens with a curmulative probabilities\n"," )[0]))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"logIR--1STwi","executionInfo":{"status":"ok","timestamp":1695365829162,"user_tz":-480,"elapsed":4283,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}},"outputId":"918f52b2-2b0e-45e9-ef95-76578e623621"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["I really wish I could have helped you out. But if you need a reminder or if you have\n"]}]},{"cell_type":"markdown","source":["Does our model contain bias? Almost certainly. Let's see what happens when we prompt with \"man\" vs. \"woman\" in terms of career text prediction."],"metadata":{"id":"ln8wIfxXyXY_"}},{"cell_type":"code","source":["prompt = \"the man works as a\"\n","result_length = len(prompt.split()) +3 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"PfgCdmP-UGq8"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Beam Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2,\n"," no_repeat_ngram_size=2,\n"," early_stopping=True\n"," )[0]))"],"metadata":{"id":"o9r3ziUtYOBW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["prompt = \"the woman works as a\"\n","result_length = len(prompt.split()) +3 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"yACTv9-WYfNz"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2,\n"," no_repeat_ngram_size=2,\n"," early_stopping=True\n"," )[0]))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BdiNc0mjYpUn","executionInfo":{"status":"ok","timestamp":1695365876633,"user_tz":-480,"elapsed":1800,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}},"outputId":"39cf0da2-96a4-4496-907d-813455d23e12"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["the woman works as a housekeeper\n"]}]},{"cell_type":"markdown","source":["Bias confirmed. This could be problematic in terms of deploying a chatbot to a production environment."],"metadata":{"id":"JeNel07xyigr"}}]}

Causal Language Modeling_Student.ipynb ADDED Viewed

	@@ -0,0 +1 @@

+ {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"1HFDC1DdA836J9Sb2jiftwgcBr8lU1fuR","timestamp":1695365565765}],"gpuType":"T4"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# Causal Language Modeling"],"metadata":{"id":"WWwTKhKrSu-Y"}},{"cell_type":"markdown","source":["Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on\n","the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model."],"metadata":{"id":"5Q1NdBUvxSpb"}},{"cell_type":"markdown","source":["Before you begin, make sure you have the necessary library installed:\n","\n","```bash\n","pip install transformers\n","```"],"metadata":{"id":"3QC3EQyzuwTa"}},{"cell_type":"code","source":["pip install transformers"],"metadata":{"id":"rAnhbPHDPcon"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":2,"metadata":{"id":"nwATTgkGPbkb","executionInfo":{"status":"ok","timestamp":1697445539730,"user_tz":-480,"elapsed":3627,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"outputs":[],"source":["import torch"]},{"cell_type":"code","source":["import transformers"],"metadata":{"id":"xf4Syb3IPjfX","executionInfo":{"status":"ok","timestamp":1697446708040,"user_tz":-480,"elapsed":2265,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":["The BLOOM model has been proposed with its various versions through the BigScience Workshop. BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact. The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages."],"metadata":{"id":"v-9wILDcvfED"}},{"cell_type":"code","source":["from transformers import BloomForCausalLM"],"metadata":{"id":"bOheix9FP9A4","executionInfo":{"status":"ok","timestamp":1697446730069,"user_tz":-480,"elapsed":1050,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":4,"outputs":[]},{"cell_type":"code","source":["from transformers import BloomTokenizerFast"],"metadata":{"id":"oz3M9SVZQAfY","executionInfo":{"status":"ok","timestamp":1697446764948,"user_tz":-480,"elapsed":559,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":["load the tokenizer and configuration objects"],"metadata":{"id":"ML22rhjUwpG6"}},{"cell_type":"code","source":["model = BloomForCausalLM.from_pretrained(\"bigscience/bloom-560m\")"],"metadata":{"id":"F0CrLrOTQC-v"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["tokenizer = BloomTokenizerFast.from_pretrained(\"bigscience/bloom-560m\")"],"metadata":{"id":"4-JHUxnxQKzq"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Define an input sentence, tokenize it and predict the next tokens int eh senetence"],"metadata":{"id":"2bD8ZoWHxBKb"}},{"cell_type":"code","source":["prompt = \"I really wish I could\" #This is the text string that the LLM will finish\n","result_length = len(prompt.split()) +15 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"s_nr12LEQt2T","executionInfo":{"status":"ok","timestamp":1697447076692,"user_tz":-480,"elapsed":516,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":8,"outputs":[]},{"cell_type":"markdown","source":["Three different strategies for generating text using Casual language Modeling"],"metadata":{"id":"0VOX2BaL2Myk"}},{"cell_type":"code","source":["# Greedy Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length\n"," )[0]))"],"metadata":{"id":"hbYCbSOfR9EN"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Beam Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2, # number of beams of bream search\n"," no_repeat_ngram_size=2, # size of n-grams to avoid repoeating\n"," early_stopping=True # whether to stop the generation process early of the model predicts an end of sequence token\n"," )[0]))"],"metadata":{"id":"rIfYc0n7SGLW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Sampling Top-k + Top-p\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," do_sample=True,\n"," top_k=50, # consider top 50 most likely token\n"," top_p=0.9 # consider tokens with a curmulative probabilities\n"," )[0]))"],"metadata":{"id":"logIR--1STwi"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Does our model contain bias? Almost certainly. Let's see what happens when we prompt with \"man\" vs. \"woman\" in terms of career text prediction."],"metadata":{"id":"ln8wIfxXyXY_"}},{"cell_type":"code","source":["prompt = \"the man works as a\"\n","result_length = len(prompt.split()) +3 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"PfgCdmP-UGq8","executionInfo":{"status":"ok","timestamp":1697447173953,"user_tz":-480,"elapsed":565,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":12,"outputs":[]},{"cell_type":"code","source":["# Beam Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2,\n"," no_repeat_ngram_size=2,\n"," early_stopping=True\n"," )[0]))"],"metadata":{"id":"o9r3ziUtYOBW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["prompt = \"the woman works as a\"\n","result_length = len(prompt.split()) +3 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"yACTv9-WYfNz","executionInfo":{"status":"ok","timestamp":1697447200275,"user_tz":-480,"elapsed":1074,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":14,"outputs":[]},{"cell_type":"code","source":["print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2,\n"," no_repeat_ngram_size=2,\n"," early_stopping=True\n"," )[0]))"],"metadata":{"id":"BdiNc0mjYpUn"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Bias confirmed. This could be problematic in terms of deploying a chatbot to a production environment."],"metadata":{"id":"JeNel07xyigr"}}]}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "</s>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "name_or_path": "distilroberta-base",
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "special_tokens_map_file": null,
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}