File size: 7,307 Bytes
c0520e2 |
1 |
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"1HFDC1DdA836J9Sb2jiftwgcBr8lU1fuR","timestamp":1695365565765}],"gpuType":"T4"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# Causal Language Modeling"],"metadata":{"id":"WWwTKhKrSu-Y"}},{"cell_type":"markdown","source":["Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on\n","the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model."],"metadata":{"id":"5Q1NdBUvxSpb"}},{"cell_type":"markdown","source":["Before you begin, make sure you have the necessary library installed:\n","\n","```bash\n","pip install transformers\n","```"],"metadata":{"id":"3QC3EQyzuwTa"}},{"cell_type":"code","source":["pip install transformers"],"metadata":{"id":"rAnhbPHDPcon"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":2,"metadata":{"id":"nwATTgkGPbkb","executionInfo":{"status":"ok","timestamp":1697445539730,"user_tz":-480,"elapsed":3627,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"outputs":[],"source":["import torch"]},{"cell_type":"code","source":["import transformers"],"metadata":{"id":"xf4Syb3IPjfX","executionInfo":{"status":"ok","timestamp":1697446708040,"user_tz":-480,"elapsed":2265,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":["The BLOOM model has been proposed with its various versions through the BigScience Workshop. BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact. The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages."],"metadata":{"id":"v-9wILDcvfED"}},{"cell_type":"code","source":["from transformers import BloomForCausalLM"],"metadata":{"id":"bOheix9FP9A4","executionInfo":{"status":"ok","timestamp":1697446730069,"user_tz":-480,"elapsed":1050,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":4,"outputs":[]},{"cell_type":"code","source":["from transformers import BloomTokenizerFast"],"metadata":{"id":"oz3M9SVZQAfY","executionInfo":{"status":"ok","timestamp":1697446764948,"user_tz":-480,"elapsed":559,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":["load the tokenizer and configuration objects"],"metadata":{"id":"ML22rhjUwpG6"}},{"cell_type":"code","source":["model = BloomForCausalLM.from_pretrained(\"bigscience/bloom-560m\")"],"metadata":{"id":"F0CrLrOTQC-v"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["tokenizer = BloomTokenizerFast.from_pretrained(\"bigscience/bloom-560m\")"],"metadata":{"id":"4-JHUxnxQKzq"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Define an input sentence, tokenize it and predict the next tokens int eh senetence"],"metadata":{"id":"2bD8ZoWHxBKb"}},{"cell_type":"code","source":["prompt = \"I really wish I could\" #This is the text string that the LLM will finish\n","result_length = len(prompt.split()) +15 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"s_nr12LEQt2T","executionInfo":{"status":"ok","timestamp":1697447076692,"user_tz":-480,"elapsed":516,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":8,"outputs":[]},{"cell_type":"markdown","source":["Three different strategies for generating text using Casual language Modeling"],"metadata":{"id":"0VOX2BaL2Myk"}},{"cell_type":"code","source":["# Greedy Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length\n"," )[0]))"],"metadata":{"id":"hbYCbSOfR9EN"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Beam Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2, # number of beams of bream search\n"," no_repeat_ngram_size=2, # size of n-grams to avoid repoeating\n"," early_stopping=True # whether to stop the generation process early of the model predicts an end of sequence token\n"," )[0]))"],"metadata":{"id":"rIfYc0n7SGLW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Sampling Top-k + Top-p\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," do_sample=True,\n"," top_k=50, # consider top 50 most likely token\n"," top_p=0.9 # consider tokens with a curmulative probabilities\n"," )[0]))"],"metadata":{"id":"logIR--1STwi"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Does our model contain bias? Almost certainly. Let's see what happens when we prompt with \"man\" vs. \"woman\" in terms of career text prediction."],"metadata":{"id":"ln8wIfxXyXY_"}},{"cell_type":"code","source":["prompt = \"the man works as a\"\n","result_length = len(prompt.split()) +3 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"PfgCdmP-UGq8","executionInfo":{"status":"ok","timestamp":1697447173953,"user_tz":-480,"elapsed":565,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":12,"outputs":[]},{"cell_type":"code","source":["# Beam Search\n","print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2,\n"," no_repeat_ngram_size=2,\n"," early_stopping=True\n"," )[0]))"],"metadata":{"id":"o9r3ziUtYOBW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["prompt = \"the woman works as a\"\n","result_length = len(prompt.split()) +3 #This determines the number of words to follow the prompt\n","inputs = tokenizer(prompt, return_tensors=\"pt\")"],"metadata":{"id":"yACTv9-WYfNz","executionInfo":{"status":"ok","timestamp":1697447200275,"user_tz":-480,"elapsed":1074,"user":{"displayName":"Jane Zhang","userId":"16942398962227920032"}}},"execution_count":14,"outputs":[]},{"cell_type":"code","source":["print(tokenizer.decode(model.generate(inputs[\"input_ids\"],\n"," max_length=result_length,\n"," num_beams=2,\n"," no_repeat_ngram_size=2,\n"," early_stopping=True\n"," )[0]))"],"metadata":{"id":"BdiNc0mjYpUn"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Bias confirmed. This could be problematic in terms of deploying a chatbot to a production environment."],"metadata":{"id":"JeNel07xyigr"}}]} |