{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Airline Sentiment Prediction using BERT\n", "\n", "### Approach\n", "First I analysed the data and I found that there was a huge imbalance in the dataset, to resolve this I used Textattack for augumentation of data.\n", "Before the augumenting the dataset I used the following techniques to clean the data & reduce the noise:\n", "- Removed the @usernames\n", "- Removed the URLs\n", "- Removed hashtags\n", "- Replacement of emojis with their meaning\n", "\n", "After cleaning the data I used EasyDataAugment of Textattack to augment the data, augmenting the data helped me to increase the accuracy of the model by more than 3%. I also tried using Clare(It replaces the words with their synonyms) but that was very resource intensive & it was taking very long to get output.\n", "\n", "### Model\n", "Since, this was a binary classification task I used BERT for training the model. I used the pretrained BERT model from Huggingface transformers library. I used the BERT model with the following parameters:\n", "- BERT-base-uncased\n", "- Max length of the input sequence: 128\n", "- Learning rate: 3e-5\n", "- Batch size: 32\n", "\n", "### Results\n", "The dataset was split into 80:20 ratio for training & validation.\n", "I got the following results after training the model:\n", "Training loss: 0.0137\n", "Validation loss: 0.1209\n", "Training accuracy: 0.9955\n", "Validation accuracy: 0.9794\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "========================================================================================================================================" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Install the required libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nwuC7017BwyD" }, "outputs": [], "source": [ "%pip install transformers\n", "%pip install emoji\n", "%pip install numpy pandas\n", "%pip install scikit-learn\n", "%pip install textattack" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Importing the libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iM2I9UEjm_pE" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pprint import pprint" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Reading the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 676 }, "id": "mrnzcvkzm_pF", "outputId": "61550835-27fc-4049-f3a3-f9ab9e1ba1bb" }, "outputs": [], "source": [ "df = pd.read_csv(\"airline_sentiment_analysis.csv\")\n", "df.head(20)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Assigning 1 to positive sentiment and 0 to negative sentiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 676 }, "id": "Jbl-wjpWm_pG", "outputId": "19c8b0ca-4506-4960-d588-505feecf678e" }, "outputs": [], "source": [ "for label in df['airline_sentiment']:\n", " if label == 'positive':\n", " df['airline_sentiment'].replace(label, 1, inplace=True)\n", " elif label == 'negative':\n", " df['airline_sentiment'].replace(label, 0, inplace=True)\n", "df.head(20)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Remove the @usernames, URLs, hashtags & Replace the emojis with their meaning" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 676 }, "id": "ApZHkGw2m_pG", "outputId": "330f44ff-3314-4392-eef4-69f2994b1cae" }, "outputs": [], "source": [ "\n", "import emoji\n", "for i,r in df.iterrows():\n", " \n", " df.loc[i,\"text\"] = emoji.demojize(df.loc[i,\"text\"])\n", " df.loc[i,\"text\"] = df.loc[i,\"text\"].replace(\":\",\" \")\n", " df.loc[i,\"text\"] = ' '.join(df.loc[i,\"text\"].split())\n", "\n", "df['text'] = df['text'].str.replace(\"@[A-Za-z0-9]+\", \"\",regex=True)\n", "df['text'] = df['text'].str.replace(\"#\", \"\",regex=True)\n", "df['text'] = df['text'].str.replace(\"https?://[A-Za-z0-9./]+\", \"\",regex=True)\n", "df['text'] = df['text'].str.replace(\"[^a-zA-Z.!?']\", \" \",regex=True)\n", "\n", "\n", "df.head(20)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Augumenting Positive Sentiment using EasyDataAugment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lNr4OWDaGYww" }, "outputs": [], "source": [ "positive_feedback = (df.loc[df[\"airline_sentiment\"] == 1])[\"text\"]\n", "positive_feedback = positive_feedback.tolist()\n", "# pprint(positive_feedback)\n", "\n", "from textattack.augmentation import EasyDataAugmenter\n", "esy_aug = EasyDataAugmenter()\n", "aug_list = []\n", "for sen in positive_feedback:\n", " aug_list.append(esy_aug.augment(sen))\n", "serial_list = []\n", "for l in aug_list:\n", " for sen in l:\n", " serial_list.append(sen)\n", "df = df.drop(df.columns[[0]],axis=1)\n", "\n", "df2 = pd.DataFrame(list(zip([1]*len(serial_list),serial_list)),columns=[\"airline_sentiment\",\"text\"])\n", "\n", "df = pd.concat([df,df2])\n", "\n", "df.to_csv(\"modified.csv\") #save the modified dataset\n", "df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Split dataset into train & validation in 80:20 ratio" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MrynUQ9Xm_pG" }, "outputs": [], "source": [ "# split the data into train and test\n", "from sklearn.model_selection import train_test_split\n", "\n", "train, test = train_test_split(df, test_size=0.2, random_state=42)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Initalise the BERT model & tokenizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MambfTNXm_pG", "outputId": "f0e11223-8e74-445a-8cdc-cc8492f26b14" }, "outputs": [], "source": [ "from transformers import BertTokenizer, TFBertForSequenceClassification\n", "from transformers import InputExample, InputFeatures\n", "import tensorflow as tf\n", "\n", "model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')\n", "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Utility function to convert the data into the format required by BERT" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uxgZ7GsEm_pH" }, "outputs": [], "source": [ "def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): \n", " train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case\n", " text_a = x[DATA_COLUMN], \n", " text_b = None,\n", " label = x[LABEL_COLUMN]), axis = 1)\n", "\n", " validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case\n", " text_a = x[DATA_COLUMN], \n", " text_b = None,\n", " label = x[LABEL_COLUMN]), axis = 1)\n", " \n", " return train_InputExamples, validation_InputExamples\n", "\n", " \n", "def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):\n", " features = [] # -> will hold InputFeatures to be converted later\n", "\n", " for e in examples:\n", " # Documentation is really strong for this method, so please take a look at it\n", " input_dict = tokenizer.encode_plus(\n", " e.text_a,\n", " add_special_tokens=True,\n", " max_length=max_length, # truncates if len(s) > max_length\n", " return_token_type_ids=True,\n", " return_attention_mask=True,\n", " pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length\n", " truncation=True\n", " )\n", "\n", " input_ids, token_type_ids, attention_mask = (input_dict[\"input_ids\"],\n", " input_dict[\"token_type_ids\"], input_dict['attention_mask'])\n", "\n", " features.append(\n", " InputFeatures(\n", " input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label\n", " )\n", " )\n", "\n", " def gen():\n", " for f in features:\n", " yield (\n", " {\n", " \"input_ids\": f.input_ids,\n", " \"attention_mask\": f.attention_mask,\n", " \"token_type_ids\": f.token_type_ids,\n", " },\n", " f.label,\n", " )\n", "\n", " return tf.data.Dataset.from_generator(\n", " gen,\n", " ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32, \"token_type_ids\": tf.int32}, tf.int64),\n", " (\n", " {\n", " \"input_ids\": tf.TensorShape([None]),\n", " \"attention_mask\": tf.TensorShape([None]),\n", " \"token_type_ids\": tf.TensorShape([None]),\n", " },\n", " tf.TensorShape([]),\n", " ),\n", " )\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "BERT model for training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tPsHpWhJm_pH", "outputId": "a9f7b2b8-d0bb-474b-d91a-25f0c8a40905" }, "outputs": [], "source": [ "DATA_COLUMN = 'text'\n", "LABEL_COLUMN = 'airline_sentiment'\n", "\n", "\n", "train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)\n", "\n", "train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)\n", "train_data = train_data.shuffle(100).batch(32).repeat(2)\n", "\n", "validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)\n", "validation_data = validation_data.batch(32)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GDcgmUOCm_pI", "outputId": "2f262b78-65f4-4cfc-deb3-a51d2b499eab" }, "outputs": [], "source": [ "model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), \n", " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), \n", " metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])\n", "\n", "model.fit(train_data, epochs=2, validation_data=validation_data)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Saving the trained weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "K3pzOJS8R1dx" }, "outputs": [], "source": [ "model.save_weights(\"weights.h5\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Inference: Predicting the sentiment of the tweet" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "BLcz_yKOr38C", "outputId": "a56245b6-395a-4ff1-aeba-ab30c98cedb1" }, "outputs": [], "source": [ "pred_data = [\"@abc The flight was great\", \"@abc ☚ī¸\",\"🎊 it was bad experience\"]\n", "pred_data = pd.DataFrame(pred_data)\n", "\n", "\n", "for i,r in pred_data.iterrows():\n", " pred_data.loc[i,0] = emoji.demojize(r[0])\n", " pred_data.loc[i,0] = r[0].replace(\":\",\" \")\n", " pred_data.loc[i,0] = ' '.join(r[0].split())\n", "\n", "\n", "pred_data[0] = pred_data[0].str.replace(\"@[A-Za-z0-9]+\", \"\",regex=True)\n", "pred_data[0] = pred_data[0].str.replace(\"#\", \"\",regex=True)\n", "pred_data[0] = pred_data[0].str.replace(\"https?://[A-Za-z0-9./]+\", \"\",regex=True)\n", "pred_data[0] = pred_data[0].str.replace(\"[^a-zA-Z.!?']\", \" \",regex=True)\n", "\n", "pred_data.head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lrTfXZzLsKd6", "outputId": "0f2c6e24-bc62-45a6-bc71-9b6918ab961e" }, "outputs": [], "source": [ "pred_data = pred_data[0].values.tolist()\n", "print(pred_data)\n", "tf_batch = tokenizer(pred_data, max_length=128, padding=True, truncation=True, return_tensors='tf')\n", "tf_outputs = model(tf_batch)\n", "tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)\n", "labels = ['Negative','Positive']\n", "label = tf.argmax(tf_predictions, axis=1)\n", "label = label.numpy()\n", "for i in range(len(pred_data)):\n", " print(pred_data[i], \": \\n\", labels[label[i]])" ] } ], "metadata": { "accelerator": "GPU", "colab": { "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "497fb9213e55408ad8b1ca9a37e341ac93888d86a532670599a03e0c8054f45a" } } }, "nbformat": 4, "nbformat_minor": 0 }