{ "cells": [ { "cell_type": "markdown", "id": "c043d99f-80e3-4553-abec-f47d668d0811", "metadata": {}, "source": [ "# Economic News Identification Using an LSTM Neural Network Approach" ] }, { "cell_type": "markdown", "id": "0ab92f40-2ad3-4c69-8594-93bcc5780201", "metadata": {}, "source": [ "### Jan Maciejowski, Fabian Perez, Ali Rammal, Louis Golding, Abdullah Ghosheh, Ayah El Barq" ] }, { "cell_type": "markdown", "id": "28a4bf63", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "id": "5021c11b", "metadata": {}, "source": [ "\n", "The main objective of this project is to create a machine learning model that can automatically classify a given piece of text as either an economical news article or a non-economical news article with a high degree of accuracy. This classification is achieved by training the model on a labeled dataset, where each instance of text is pre-identified relevant or non-relevant. The model learns to recognize patterns and features tied to high context keywords, enabling it to generalize and make accurate predictions on unseen data.\n", "\n", "This capability is crucial for applications such as content filtering, media analysis, and information retrieval, where distinguishing between journalistic content and other types of text is necessary. By automating this process, the model aims to assist in efficiently managing and categorizing large volumes of textual data, enhancing the effectiveness of digital content management systems, and providing valuable insights into the nature and distribution of information across various media platforms.\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "3bduh-46OHFK", "metadata": { "id": "3bduh-46OHFK" }, "source": [ "# Preprocess" ] }, { "cell_type": "markdown", "id": "f1bc12d3", "metadata": {}, "source": [ "## Primary Steps\n", "### Packages & NLTK Data Downloads\n", "Aside from the essential packages for data handling and visualization (pandas, numpy, matplotlib), tensorflow keras libraries are used for the completion of the objective, involving tokenizer and pad_sequences. Furthermore, the NLTK library is used for the text cleaning phase in order to fit text into vectorization. Packages such as Punkt - which divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; Stopwords - removes words that frequently appear in any language or corpus; Wordent - a lexical database of English which helps find conceptual relationships between words such as hypernyms, hyponyms, synonyms, antonyms etc.\n", "\n", "### Variable Selection and Cleaning\n", "As the dataset originally contains 14 variables in total, the most viable decision is to emphasize on the string variables which offer the context for economic newspaper detection. Furthermore, variables ‘text’ and ‘headline’ (the only ones used for model training) offer the needed context for a great generalizable model, while the rest of the variables strongly lack any relevance that facilitates proper predictions due to biased metrics based on low number of survey samples.\n", "\t\n", "The ‘relevance’ variable (originally in string type of data) was transformed into a binary variable where (1) = economic newspaper article & (0) = any other. Then, the dataset was reduced into a 50:50 ratio of relevant and non-relevant articles from merging a random sample of non-relevant articles with size length equal to all relevant articles. Headlines and the full text were embedded together into a single string. The training text data was cleaned through extra symbol removal, and split into words for the removal of the stop words and lemmatization. From this point, the text data is classified as clean string data. \n", "\n", "### Tokenization & Padding\n", "\n", "This step, also called vectorization, is performed through the tokenizer function form tensorflow turns each string into a sequence of numbers for the model to identify relevant articles with higher frequency of words related to the field of interest. By default, all punctuation is removed, turning the texts into space-separated sequences of words. These sequences are then split into vectorized lists of tokens. They will then be indexed or vectorized. After this process is performed, the train data is prepared for padding.\n", " \n", "The step of padding the vectorized sequences is required since the model expects similar observation sizes, and the text of each newspaper article is different from each other. For this, the padding process involves identifying the longest sequence and setting all observations to that longest size. Any extra space per observation is deemed a zero. \n", "\t\n", "From this point, our data was split into training, testing, and validation sets for the maximum assurance of our model’s generalizability towards new data. All these steps are seen below.\n" ] }, { "cell_type": "markdown", "id": "422b1866-5ebc-42cb-b8f3-13e50355d481", "metadata": {}, "source": [ "First, we must read in all necessary packages." ] }, { "cell_type": "code", "execution_count": 85, "id": "iUFDKHaVOHFN", "metadata": { "id": "iUFDKHaVOHFN" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\majon\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\tensorflow_addons\\utils\\tfa_eol_msg.py:23: UserWarning: \n", "\n", "TensorFlow Addons (TFA) has ended development and introduction of new features.\n", "TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.\n", "Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). \n", "\n", "For more information see: https://github.com/tensorflow/addons/issues/2807 \n", "\n", " warnings.warn(\n", "C:\\Users\\majon\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\tensorflow_addons\\utils\\ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.12.0 and strictly below 2.15.0 (nightly versions are not supported). \n", " The versions of TensorFlow you are currently using is 2.15.0 and is not supported. \n", "Some things might work, some things might not.\n", "If you were to encounter a bug, do not file an issue.\n", "If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. \n", "You can find the compatibility matrix in TensorFlow Addon's readme:\n", "https://github.com/tensorflow/addons\n", " warnings.warn(\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import re\n", "import matplotlib.pyplot as plt\n", "\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.tokenize import word_tokenize\n", "from nltk.stem import WordNetLemmatizer\n", "\n", "import tensorflow as tf\n", "from tensorflow.keras.preprocessing.text import Tokenizer\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional\n", "from tensorflow.keras.callbacks import EarlyStopping\n", "\n", "from sklearn.metrics import confusion_matrix\n", "\n", "from tensorflow.keras.optimizers import Adamax\n", "from tensorflow.keras.optimizers import Nadam\n", "import tensorflow_addons as tfa" ] }, { "cell_type": "markdown", "id": "29f0182f-e319-4777-a048-ee20ff91846b", "metadata": {}, "source": [ "In addition to that we need some NLTK datasets with english stopword, that must be removed and lemmatized." ] }, { "cell_type": "code", "execution_count": 2, "id": "69a7642b-c79a-4752-8864-7d2ffa7edc14", "metadata": {}, "outputs": [], "source": [ "# Ensure you have downloaded the necessary NLTK data\n", "# nltk.download('punkt')\n", "# nltk.download('stopwords')\n", "# nltk.download('wordnet')" ] }, { "cell_type": "markdown", "id": "6f88343b-9426-4b38-b585-8a301e490bd5", "metadata": {}, "source": [ "Reading the data" ] }, { "cell_type": "code", "execution_count": 3, "id": "beLBhgRLOHFP", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "beLBhgRLOHFP", "outputId": "fe017141-8c90-4f18-d260-b91583d3e971" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 8000 entries, 0 to 7999\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 _unit_id 8000 non-null int64 \n", " 1 _golden 8000 non-null bool \n", " 2 _unit_state 8000 non-null object \n", " 3 _trusted_judgments 8000 non-null int64 \n", " 4 _last_judgment_at 8000 non-null object \n", " 5 positivity 1420 non-null float64\n", " 6 positivity:confidence 3775 non-null float64\n", " 7 relevance 8000 non-null object \n", " 8 relevance:confidence 8000 non-null float64\n", " 9 articleid 8000 non-null object \n", " 10 date 8000 non-null object \n", " 11 headline 8000 non-null object \n", " 12 positivity_gold 0 non-null float64\n", " 13 relevance_gold 0 non-null float64\n", " 14 text 8000 non-null object \n", "dtypes: bool(1), float64(5), int64(2), object(7)\n", "memory usage: 882.9+ KB\n" ] } ], "source": [ "df = pd.read_csv(\"./US-Economic-News.csv\", delimiter=',', encoding= 'ISO-8859-1')\n", "\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 4, "id": "2pZ91v4YOHFQ", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 573 }, "id": "2pZ91v4YOHFQ", "outputId": "2ecb5bf5-5337-4d44-bf14-f45d0725524e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_unit_id_golden_unit_state_trusted_judgments_last_judgment_atpositivitypositivity:confidencerelevancerelevance:confidencearticleiddateheadlinepositivity_goldrelevance_goldtext
0842613455Falsefinalized312/5/15 17:483.00.6400yes0.640wsj_3982177888/14/91Yields on CDs Fell in the Latest WeekNaNNaNNEW YORK -- Yields on most certificates of dep...
1842613456Falsefinalized312/5/15 16:54NaNNaNno1.000wsj_3990195028/21/07The Morning Brief: White House Seeks to Limit ...NaNNaNThe Wall Street Journal Online</br></br>The Mo...
2842613457Falsefinalized312/5/15 1:59NaNNaNno1.000wsj_39828404811/14/91Banking Bill Negotiators Set Compromise --- Pl...NaNNaNWASHINGTON -- In an effort to achieve banking ...
3842613458Falsefinalized312/5/15 2:19NaN0.0000no0.675wsj_3979590186/16/86Manager's Journal: Sniffing Out Drug Abusers I...NaNNaNThe statistics on the enormous costs of employ...
4842613459Falsefinalized312/5/15 17:483.00.3257yes0.640wsj_39883805410/4/02Currency Trading: Dollar Remains in Tight Rang...NaNNaNNEW YORK -- Indecision marked the dollar's ton...
\n", "
" ], "text/plain": [ " _unit_id _golden _unit_state _trusted_judgments _last_judgment_at \\\n", "0 842613455 False finalized 3 12/5/15 17:48 \n", "1 842613456 False finalized 3 12/5/15 16:54 \n", "2 842613457 False finalized 3 12/5/15 1:59 \n", "3 842613458 False finalized 3 12/5/15 2:19 \n", "4 842613459 False finalized 3 12/5/15 17:48 \n", "\n", " positivity positivity:confidence relevance relevance:confidence \\\n", "0 3.0 0.6400 yes 0.640 \n", "1 NaN NaN no 1.000 \n", "2 NaN NaN no 1.000 \n", "3 NaN 0.0000 no 0.675 \n", "4 3.0 0.3257 yes 0.640 \n", "\n", " articleid date headline \\\n", "0 wsj_398217788 8/14/91 Yields on CDs Fell in the Latest Week \n", "1 wsj_399019502 8/21/07 The Morning Brief: White House Seeks to Limit ... \n", "2 wsj_398284048 11/14/91 Banking Bill Negotiators Set Compromise --- Pl... \n", "3 wsj_397959018 6/16/86 Manager's Journal: Sniffing Out Drug Abusers I... \n", "4 wsj_398838054 10/4/02 Currency Trading: Dollar Remains in Tight Rang... \n", "\n", " positivity_gold relevance_gold \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " text \n", "0 NEW YORK -- Yields on most certificates of dep... \n", "1 The Wall Street Journal Online

The Mo... \n", "2 WASHINGTON -- In an effort to achieve banking ... \n", "3 The statistics on the enormous costs of employ... \n", "4 NEW YORK -- Indecision marked the dollar's ton... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "id": "411e6c10-3f9a-4d6d-b1ec-da281bc6109f", "metadata": {}, "source": [ "Removing unneccessary columns." ] }, { "cell_type": "code", "execution_count": 6, "id": "6dfFHenMOHFQ", "metadata": { "id": "6dfFHenMOHFQ" }, "outputs": [], "source": [ "df = df[['headline', 'text', 'relevance']]\n", "\n", "# We drop all irrelavant features to only keep headline and text for 2 reasons: \n", "# The other features seem either irrelevant or we lack documentation\n", "# With headline and text only, our final model will be more generalizable. We could in theory apply it to any article." ] }, { "cell_type": "markdown", "id": "50e304c5-f9ac-4ad2-b1b0-5ceac6fb6429", "metadata": {}, "source": [ "Balancing the dataset to 50% relevant and 50% not relevant." ] }, { "cell_type": "code", "execution_count": 7, "id": "609b72fc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "relevance\n", "yes 1420\n", "no 1420\n", "Name: count, dtype: int64\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df_yes = df[df['relevance'] == 'yes']\n", "df_no = df[df['relevance'] == 'no']\n", "\n", "df_no_sampled = df_no.sample(n=len(df_yes), random_state=42)\n", "\n", "# Concatenate the sampled 'no' rows with all 'yes' rows\n", "df_balanced = pd.concat([df_yes, df_no_sampled])\n", "\n", "print(df_balanced['relevance'].value_counts())" ] }, { "cell_type": "code", "execution_count": 8, "id": "2dd88fd1", "metadata": {}, "outputs": [], "source": [ "df = df_balanced" ] }, { "cell_type": "markdown", "id": "jK-dXnQAOHFR", "metadata": { "id": "jK-dXnQAOHFR" }, "source": [ "## Cleaning Strings" ] }, { "cell_type": "markdown", "id": "ea70a0cf-f409-4c35-a215-c5585b06953f", "metadata": {}, "source": [ "Here we merge the title and the full text into one string, we will process in whole." ] }, { "cell_type": "code", "execution_count": 9, "id": "hwsu0fkMOHFS", "metadata": { "id": "hwsu0fkMOHFS" }, "outputs": [], "source": [ "df['whole_txt'] = df['headline']+ ' ' + df['text']" ] }, { "cell_type": "code", "execution_count": 10, "id": "tGK5t8RyOHFT", "metadata": { "id": "tGK5t8RyOHFT" }, "outputs": [], "source": [ "wtxt_train = np.array(df['whole_txt'])" ] }, { "cell_type": "markdown", "id": "d43863c2-084d-47c6-ae75-62d2f9d46655", "metadata": {}, "source": [ "Removing of special signs, number ect." ] }, { "cell_type": "code", "execution_count": 11, "id": "8EX1KU3FOHFT", "metadata": { "id": "8EX1KU3FOHFT" }, "outputs": [], "source": [ "for i in range(len(wtxt_train)):\n", " # Taking out '
' in the 'whole_text' column\n", " wtxt_train[i] = re.sub(r'', ' ', wtxt_train[i])\n", " # Deletion of non-latin alfabet signs, also numbers\n", " wtxt_train[i] = re.sub(r'[^a-zA-Z]', ' ', wtxt_train[i])\n", " # Removing single letter works like 'a'.\n", " wtxt_train[i] = re.sub(r\"\\s+[a-zA-Z]\\s+\", ' ', wtxt_train[i])\n", " # Removing double spaces\n", " wtxt_train[i] = re.sub(r'\\s+', ' ', wtxt_train[i])\n", " # Lower case\n", " wtxt_train[i] = wtxt_train[i].lower()" ] }, { "cell_type": "markdown", "id": "OGok8jBiOHFT", "metadata": { "id": "OGok8jBiOHFT" }, "source": [ "## Split the words.\n", "We split the string into many strings representing words encoded here as elements of a list." ] }, { "cell_type": "code", "execution_count": 12, "id": "YLC_hbBPOHFT", "metadata": { "id": "YLC_hbBPOHFT" }, "outputs": [], "source": [ "for i in range(len(wtxt_train)):\n", " wtxt_train[i] = word_tokenize(wtxt_train[i])" ] }, { "cell_type": "markdown", "id": "nJcjnqLHOHFT", "metadata": { "id": "nJcjnqLHOHFT" }, "source": [ "## Removing stop words. \n", "We are removing stop words like for example: the, they, them, for. Those are words that bring no meritorical value to the articles topic since they are just a non meaning bringing punctuation necessary in the language. By removing them we can also save on size and therefore computational power." ] }, { "cell_type": "code", "execution_count": 13, "id": "7uoaH6INOHFT", "metadata": { "id": "7uoaH6INOHFT" }, "outputs": [], "source": [ "stop_words = set(stopwords.words('english'))\n", "\n", "for i in range(len(wtxt_train)):\n", " wtxt_train[i] = [word for word in wtxt_train[i] if word not in stop_words]" ] }, { "cell_type": "code", "execution_count": 14, "id": "74Ii4vxnOHFU", "metadata": { "id": "74Ii4vxnOHFU" }, "outputs": [ { "data": { "text/plain": [ "['yields',\n", " 'cds',\n", " 'fell',\n", " 'latest',\n", " 'week',\n", " 'new',\n", " 'york',\n", " 'yields',\n", " 'certificates',\n", " 'deposit',\n", " 'offered',\n", " 'major',\n", " 'banks',\n", " 'dropped',\n", " 'tenth',\n", " 'percentage',\n", " 'point',\n", " 'latest',\n", " 'week',\n", " 'reflecting',\n", " 'overall',\n", " 'decline',\n", " 'short',\n", " 'term',\n", " 'interest',\n", " 'rates',\n", " 'small',\n", " 'denomination',\n", " 'consumer',\n", " 'cds',\n", " 'sold',\n", " 'directly',\n", " 'banks',\n", " 'average',\n", " 'yield',\n", " 'six',\n", " 'month',\n", " 'deposits',\n", " 'fell',\n", " 'week',\n", " 'ended',\n", " 'yesterday',\n", " 'according',\n", " 'bank',\n", " 'survey',\n", " 'banxquote',\n", " 'money',\n", " 'markets',\n", " 'wilmington',\n", " 'del',\n", " 'information',\n", " 'service',\n", " 'three',\n", " 'month',\n", " 'consumer',\n", " 'deposits',\n", " 'average',\n", " 'yield',\n", " 'sank',\n", " 'week',\n", " 'according',\n", " 'banxquote',\n", " 'two',\n", " 'banks',\n", " 'banxquote',\n", " 'survey',\n", " 'citibank',\n", " 'new',\n", " 'york',\n", " 'corestates',\n", " 'pennsylvania',\n", " 'paying',\n", " 'less',\n", " 'threemonth',\n", " 'small',\n", " 'denomination',\n", " 'cds',\n", " 'declines',\n", " 'somewhat',\n", " 'smaller',\n", " 'five',\n", " 'year',\n", " 'consumer',\n", " 'cds',\n", " 'eased',\n", " 'banxquote',\n", " 'said',\n", " 'yields',\n", " 'three',\n", " 'month',\n", " 'six',\n", " 'month',\n", " 'treasury',\n", " 'bills',\n", " 'sold',\n", " 'monday',\n", " 'auction',\n", " 'plummeted',\n", " 'fifth',\n", " 'percentage',\n", " 'point',\n", " 'previous',\n", " 'week',\n", " 'respectively']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wtxt_train[0]\n", "# stop_words" ] }, { "cell_type": "markdown", "id": "0Ih9nJCDOHFU", "metadata": { "id": "0Ih9nJCDOHFU" }, "source": [ "## Lemmatization\n", "That means bringing the words with different endings to their initial meaning and form. " ] }, { "cell_type": "code", "execution_count": 15, "id": "tZq0soQFOHFU", "metadata": { "id": "tZq0soQFOHFU" }, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()\n", "for i in range(len(wtxt_train)):\n", " wtxt_train[i] = [lemmatizer.lemmatize(word) for word in wtxt_train[i]]" ] }, { "cell_type": "code", "execution_count": 16, "id": "yh7EcIK7OHFU", "metadata": { "id": "yh7EcIK7OHFU" }, "outputs": [], "source": [ "df['whole_txt'] = wtxt_train\n", "df = df.drop(['headline', 'text'], axis = 1)" ] }, { "cell_type": "code", "execution_count": 17, "id": "T1TLLZZhOHFU", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "T1TLLZZhOHFU", "outputId": "127d4d92-3e49-4449-8bff-a5ab1fa375cc" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relevancewhole_txt
0yes[yield, cd, fell, latest, week, new, york, yie...
4yes[currency, trading, dollar, remains, tight, ra...
5yes[stock, fall, bofa, alcoa, slide, stock, decli...
9yes[u, dollar, fall, currency, decline, softened,...
12yes[defending, deflation, author, james, stewart,...
\n", "
" ], "text/plain": [ " relevance whole_txt\n", "0 yes [yield, cd, fell, latest, week, new, york, yie...\n", "4 yes [currency, trading, dollar, remains, tight, ra...\n", "5 yes [stock, fall, bofa, alcoa, slide, stock, decli...\n", "9 yes [u, dollar, fall, currency, decline, softened,...\n", "12 yes [defending, deflation, author, james, stewart,..." ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "id": "2a9ecc91", "metadata": { "id": "2a9ecc91" }, "source": [ "### Data preparation\n", "* Initial Data Processing: Our first step is to encode the relevance label into both the Relevant (1) and non-Relevant labels (0). Then, we make it into a np.array to feed into the model.\n", "* Then, we begin to clean text data into pad sequences." ] }, { "cell_type": "code", "execution_count": 18, "id": "8d046d1e", "metadata": {}, "outputs": [], "source": [ "df.update(df[\"relevance\"].apply(lambda x: 0 if x == \"no\" else 1))" ] }, { "cell_type": "code", "execution_count": 19, "id": "c64fe00d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relevancewhole_txt
01[yield, cd, fell, latest, week, new, york, yie...
41[currency, trading, dollar, remains, tight, ra...
51[stock, fall, bofa, alcoa, slide, stock, decli...
91[u, dollar, fall, currency, decline, softened,...
121[defending, deflation, author, james, stewart,...
\n", "
" ], "text/plain": [ " relevance whole_txt\n", "0 1 [yield, cd, fell, latest, week, new, york, yie...\n", "4 1 [currency, trading, dollar, remains, tight, ra...\n", "5 1 [stock, fall, bofa, alcoa, slide, stock, decli...\n", "9 1 [u, dollar, fall, currency, decline, softened,...\n", "12 1 [defending, deflation, author, james, stewart,..." ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "id": "bc686094", "metadata": { "id": "bc686094" }, "source": [ "### Tokenization\n", "First, we need to \"tokenize\" our sentences, i.e., convert them to sequences of numbers. For this task, we are going to use the `Tokenizer` from Tensorflow (documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer))" ] }, { "cell_type": "code", "execution_count": 20, "id": "cf9eaddf", "metadata": { "id": "cf9eaddf" }, "outputs": [], "source": [ "tokenizer = Tokenizer()\n", "tokenizer.fit_on_texts(wtxt_train) # fit our tokenizer on the dataset (i.e., assign a number to each word and keep a\n", " # dictionary with the correspondence of each word to a number)\n", "\n", "# see the language dictionary and the total number of words (please note that number 0 is reserved for the padding task)\n", "word_index = tokenizer.word_index\n", "total_words = len(word_index) + 1" ] }, { "cell_type": "code", "execution_count": 21, "id": "1bb124ce", "metadata": { "id": "1bb124ce" }, "outputs": [ { "data": { "text/plain": [ "{'year': 1,\n", " 'rate': 2,\n", " 'market': 3,\n", " 'said': 4,\n", " 'stock': 5,\n", " 'price': 6,\n", " 'new': 7,\n", " 'economy': 8,\n", " 'economic': 9,\n", " 'month': 10,\n", " 'federal': 11,\n", " 'would': 12,\n", " 'interest': 13,\n", " 'percent': 14,\n", " 'last': 15,\n", " 'week': 16,\n", " 'inflation': 17,\n", " 'bank': 18,\n", " 'billion': 19,\n", " 'fed': 20,\n", " 'dollar': 21,\n", " 'bond': 22,\n", " 'point': 23,\n", " 'growth': 24,\n", " 'investor': 25,\n", " 'one': 26,\n", " 'company': 27,\n", " 'million': 28,\n", " 'index': 29,\n", " 'since': 30,\n", " 'york': 31,\n", " 'quarter': 32,\n", " 'average': 33,\n", " 'first': 34,\n", " 'time': 35,\n", " 'tax': 36,\n", " 'increase': 37,\n", " 'reserve': 38,\n", " 'may': 39,\n", " 'government': 40,\n", " 'president': 41,\n", " 'report': 42,\n", " 'business': 43,\n", " 'day': 44,\n", " 'rose': 45,\n", " 'say': 46,\n", " 'consumer': 47,\n", " 'also': 48,\n", " 'yesterday': 49,\n", " 'two': 50,\n", " 'economist': 51,\n", " 'dow': 52,\n", " 'sale': 53,\n", " 'many': 54,\n", " 'job': 55,\n", " 'fund': 56,\n", " 'share': 57,\n", " 'could': 58,\n", " 'high': 59,\n", " 'gain': 60,\n", " 'higher': 61,\n", " 'trading': 62,\n", " 'cut': 63,\n", " 'deficit': 64,\n", " 'state': 65,\n", " 'analyst': 66,\n", " 'decline': 67,\n", " 'money': 68,\n", " 'spending': 69,\n", " 'fell': 70,\n", " 'mr': 71,\n", " 'recession': 72,\n", " 'financial': 73,\n", " 'term': 74,\n", " 'treasury': 75,\n", " 'policy': 76,\n", " 'even': 77,\n", " 'rise': 78,\n", " 'good': 79,\n", " 'industrial': 80,\n", " 'level': 81,\n", " 'unemployment': 82,\n", " 'much': 83,\n", " 'american': 84,\n", " 'department': 85,\n", " 'today': 86,\n", " 'trade': 87,\n", " 'low': 88,\n", " 'cent': 89,\n", " 'lower': 90,\n", " 'expected': 91,\n", " 'washington': 92,\n", " 'cost': 93,\n", " 'still': 94,\n", " 'plan': 95,\n", " 'budget': 96,\n", " 'jones': 97,\n", " 'three': 98,\n", " 'nation': 99,\n", " 'official': 100,\n", " 'house': 101,\n", " 'labor': 102,\n", " 'recent': 103,\n", " 'security': 104,\n", " 'home': 105,\n", " 'people': 106,\n", " 'next': 107,\n", " 'long': 108,\n", " 'exchange': 109,\n", " 'oil': 110,\n", " 'recovery': 111,\n", " 'make': 112,\n", " 'record': 113,\n", " 'news': 114,\n", " 'industry': 115,\n", " 'end': 116,\n", " 'investment': 117,\n", " 'back': 118,\n", " 'credit': 119,\n", " 'number': 120,\n", " 'second': 121,\n", " 'income': 122,\n", " 'profit': 123,\n", " 'past': 124,\n", " 'issue': 125,\n", " 'chairman': 126,\n", " 'data': 127,\n", " 'strong': 128,\n", " 'board': 129,\n", " 'short': 130,\n", " 'big': 131,\n", " 'according': 132,\n", " 'le': 133,\n", " 'chief': 134,\n", " 'service': 135,\n", " 'friday': 136,\n", " 'administration': 137,\n", " 'major': 138,\n", " 'late': 139,\n", " 'like': 140,\n", " 'world': 141,\n", " 'reported': 142,\n", " 'yield': 143,\n", " 'drop': 144,\n", " 'currency': 145,\n", " 'well': 146,\n", " 'mortgage': 147,\n", " 'loan': 148,\n", " 'earlier': 149,\n", " 'group': 150,\n", " 'firm': 151,\n", " 'fall': 152,\n", " 'inc': 153,\n", " 'earnings': 154,\n", " 'street': 155,\n", " 'rising': 156,\n", " 'per': 157,\n", " 'loss': 158,\n", " 'country': 159,\n", " 'worker': 160,\n", " 'way': 161,\n", " 'work': 162,\n", " 'central': 163,\n", " 'another': 164,\n", " 'co': 165,\n", " 'third': 166,\n", " 'get': 167,\n", " 'move': 168,\n", " 'future': 169,\n", " 'early': 170,\n", " 'wall': 171,\n", " 'sign': 172,\n", " 'national': 173,\n", " 'bill': 174,\n", " 'program': 175,\n", " 'january': 176,\n", " 'figure': 177,\n", " 'congress': 178,\n", " 'bush': 179,\n", " 'part': 180,\n", " 'annual': 181,\n", " 'capital': 182,\n", " 'take': 183,\n", " 'see': 184,\n", " 'trader': 185,\n", " 'close': 186,\n", " 'likely': 187,\n", " 'change': 188,\n", " 'made': 189,\n", " 'june': 190,\n", " 'demand': 191,\n", " 'among': 192,\n", " 'housing': 193,\n", " 'half': 194,\n", " 'july': 195,\n", " 'pay': 196,\n", " 'nearly': 197,\n", " 'nasdaq': 198,\n", " 'little': 199,\n", " 'march': 200,\n", " 'system': 201,\n", " 'result': 202,\n", " 'debt': 203,\n", " 'come': 204,\n", " 'raise': 205,\n", " 'problem': 206,\n", " 'concern': 207,\n", " 'rally': 208,\n", " 'show': 209,\n", " 'four': 210,\n", " 'standard': 211,\n", " 'committee': 212,\n", " 'product': 213,\n", " 'small': 214,\n", " 'corp': 215,\n", " 'keep': 216,\n", " 'foreign': 217,\n", " 'forecast': 218,\n", " 'corporate': 219,\n", " 'ago': 220,\n", " 'continued': 221,\n", " 'u': 222,\n", " 'april': 223,\n", " 'city': 224,\n", " 'yen': 225,\n", " 'real': 226,\n", " 'global': 227,\n", " 'greenspan': 228,\n", " 'meeting': 229,\n", " 'help': 230,\n", " 'however': 231,\n", " 'employment': 232,\n", " 'sector': 233,\n", " 'fear': 234,\n", " 'far': 235,\n", " 'value': 236,\n", " 'based': 237,\n", " 'despite': 238,\n", " 'ahead': 239,\n", " 'including': 240,\n", " 'poor': 241,\n", " 'hit': 242,\n", " 'executive': 243,\n", " 'several': 244,\n", " 'go': 245,\n", " 'might': 246,\n", " 'survey': 247,\n", " 'volume': 248,\n", " 'period': 249,\n", " 'current': 250,\n", " 'session': 251,\n", " 'benefit': 252,\n", " 'measure': 253,\n", " 'put': 254,\n", " 'order': 255,\n", " 'large': 256,\n", " 'pace': 257,\n", " 'public': 258,\n", " 'set': 259,\n", " 'added': 260,\n", " 'area': 261,\n", " 'japan': 262,\n", " 'compared': 263,\n", " 'think': 264,\n", " 'increased': 265,\n", " 'start': 266,\n", " 'going': 267,\n", " 'commerce': 268,\n", " 'biggest': 269,\n", " 'tuesday': 270,\n", " 'five': 271,\n", " 'better': 272,\n", " 'need': 273,\n", " 'fiscal': 274,\n", " 'pressure': 275,\n", " 'international': 276,\n", " 'continue': 277,\n", " 'white': 278,\n", " 'finance': 279,\n", " 'united': 280,\n", " 'thursday': 281,\n", " 'energy': 282,\n", " 'largest': 283,\n", " 'least': 284,\n", " 'six': 285,\n", " 'percentage': 286,\n", " 'revenue': 287,\n", " 'though': 288,\n", " 'general': 289,\n", " 'mark': 290,\n", " 'growing': 291,\n", " 'february': 292,\n", " 'sell': 293,\n", " 'sharply': 294,\n", " 'crisis': 295,\n", " 'august': 296,\n", " 'latest': 297,\n", " 'ended': 298,\n", " 'dropped': 299,\n", " 'monday': 300,\n", " 'buy': 301,\n", " 'came': 302,\n", " 'deal': 303,\n", " 'october': 304,\n", " 'best': 305,\n", " 'wednesday': 306,\n", " 'member': 307,\n", " 'already': 308,\n", " 'office': 309,\n", " 'senate': 310,\n", " 'previous': 311,\n", " 'slightly': 312,\n", " 'december': 313,\n", " 'enough': 314,\n", " 'november': 315,\n", " 'export': 316,\n", " 'technology': 317,\n", " 'around': 318,\n", " 'risk': 319,\n", " 'manager': 320,\n", " 'began': 321,\n", " 'euro': 322,\n", " 'yet': 323,\n", " 'supply': 324,\n", " 'composite': 325,\n", " 'lost': 326,\n", " 'health': 327,\n", " 'whether': 328,\n", " 'weak': 329,\n", " 'top': 330,\n", " 'buying': 331,\n", " 'september': 332,\n", " 'boost': 333,\n", " 'clinton': 334,\n", " 'wage': 335,\n", " 'face': 336,\n", " 'outlook': 337,\n", " 'thing': 338,\n", " 'making': 339,\n", " 'right': 340,\n", " 'showed': 341,\n", " 'look': 342,\n", " 'falling': 343,\n", " 'monetary': 344,\n", " 'reagan': 345,\n", " 'estimate': 346,\n", " 'war': 347,\n", " 'closed': 348,\n", " 'activity': 349,\n", " 'force': 350,\n", " 'america': 351,\n", " 'lowest': 352,\n", " 'action': 353,\n", " 'expect': 354,\n", " 'declined': 355,\n", " 'although': 356,\n", " 'hour': 357,\n", " 'highest': 358,\n", " 'return': 359,\n", " 'gold': 360,\n", " 'effort': 361,\n", " 'private': 362,\n", " 'fourth': 363,\n", " 'management': 364,\n", " 'banking': 365,\n", " 'republican': 366,\n", " 'run': 367,\n", " 'total': 368,\n", " 'mean': 369,\n", " 'food': 370,\n", " 'effect': 371,\n", " 'maker': 372,\n", " 'import': 373,\n", " 'advance': 374,\n", " 'retail': 375,\n", " 'almost': 376,\n", " 'worry': 377,\n", " 'post': 378,\n", " 'indicator': 379,\n", " 'amount': 380,\n", " 'sharp': 381,\n", " 'key': 382,\n", " 'selling': 383,\n", " 'political': 384,\n", " 'expectation': 385,\n", " 'production': 386,\n", " 'without': 387,\n", " 'question': 388,\n", " 'want': 389,\n", " 'coming': 390,\n", " 'manufacturing': 391,\n", " 'hope': 392,\n", " 'released': 393,\n", " 'domestic': 394,\n", " 'note': 395,\n", " 'become': 396,\n", " 'confidence': 397,\n", " 'law': 398,\n", " 'every': 399,\n", " 'overall': 400,\n", " 'chip': 401,\n", " 'director': 402,\n", " 'computer': 403,\n", " 'auto': 404,\n", " 'soon': 405,\n", " 'vice': 406,\n", " 'slow': 407,\n", " 'call': 408,\n", " 'strength': 409,\n", " 'gained': 410,\n", " 'employee': 411,\n", " 'blue': 412,\n", " 'support': 413,\n", " 'decade': 414,\n", " 'leader': 415,\n", " 'jobless': 416,\n", " 'cash': 417,\n", " 'give': 418,\n", " 'led': 419,\n", " 'view': 420,\n", " 'taking': 421,\n", " 'offer': 422,\n", " 'research': 423,\n", " 'university': 424,\n", " 'democrat': 425,\n", " 'open': 426,\n", " 'full': 427,\n", " 'old': 428,\n", " 'car': 429,\n", " 'surge': 430,\n", " 'called': 431,\n", " 'seen': 432,\n", " 'proposal': 433,\n", " 'insurance': 434,\n", " 'industrials': 435,\n", " 'line': 436,\n", " 'account': 437,\n", " 'power': 438,\n", " 'net': 439,\n", " 'amid': 440,\n", " 'p': 441,\n", " 'adjusted': 442,\n", " 'care': 443,\n", " 'reason': 444,\n", " 'turn': 445,\n", " 'near': 446,\n", " 'trend': 447,\n", " 'head': 448,\n", " 'retailer': 449,\n", " 'contract': 450,\n", " 'later': 451,\n", " 'europe': 452,\n", " 'asset': 453,\n", " 'lead': 454,\n", " 'monthly': 455,\n", " 'japanese': 456,\n", " 'longer': 457,\n", " 'remain': 458,\n", " 'senior': 459,\n", " 'slowdown': 460,\n", " 'lot': 461,\n", " 'saving': 462,\n", " 'control': 463,\n", " 'find': 464,\n", " 'hold': 465,\n", " 'option': 466,\n", " 'social': 467,\n", " 'purchase': 468,\n", " 'school': 469,\n", " 'expansion': 470,\n", " 'european': 471,\n", " 'county': 472,\n", " 'toward': 473,\n", " 'association': 474,\n", " 'decision': 475,\n", " 'left': 476,\n", " 'told': 477,\n", " 'mixed': 478,\n", " 'reduce': 479,\n", " 'china': 480,\n", " 'believe': 481,\n", " 'election': 482,\n", " 'slowing': 483,\n", " 'target': 484,\n", " 'output': 485,\n", " 'performance': 486,\n", " 'secretary': 487,\n", " 'bad': 488,\n", " 'evidence': 489,\n", " 'announced': 490,\n", " 'store': 491,\n", " 'gross': 492,\n", " 'unit': 493,\n", " 'important': 494,\n", " 'case': 495,\n", " 'producer': 496,\n", " 'must': 497,\n", " 'recently': 498,\n", " 'productivity': 499,\n", " 'john': 500,\n", " 'leading': 501,\n", " 'took': 502,\n", " 'helped': 503,\n", " 'claim': 504,\n", " 'held': 505,\n", " 'life': 506,\n", " 'family': 507,\n", " 'council': 508,\n", " 'region': 509,\n", " 'following': 510,\n", " 'statement': 511,\n", " 'hand': 512,\n", " 'condition': 513,\n", " 'summer': 514,\n", " 'push': 515,\n", " 'campaign': 516,\n", " 'away': 517,\n", " 'congressional': 518,\n", " 'talk': 519,\n", " 'obama': 520,\n", " 'robert': 521,\n", " 'lending': 522,\n", " 'remains': 523,\n", " 'within': 524,\n", " 'meanwhile': 525,\n", " 'alan': 526,\n", " 'broad': 527,\n", " 'place': 528,\n", " 'begin': 529,\n", " 'looking': 530,\n", " 'agency': 531,\n", " 'grew': 532,\n", " 'jumped': 533,\n", " 'union': 534,\n", " 'impact': 535,\n", " 'district': 536,\n", " 'raising': 537,\n", " 'institution': 538,\n", " 'due': 539,\n", " 'payroll': 540,\n", " 'payment': 541,\n", " 'reduction': 542,\n", " 'banker': 543,\n", " 'former': 544,\n", " 'operation': 545,\n", " 'personal': 546,\n", " 'seven': 547,\n", " 'great': 548,\n", " 'factor': 549,\n", " 'individual': 550,\n", " 'local': 551,\n", " 'development': 552,\n", " 'steel': 553,\n", " 'posted': 554,\n", " 'mutual': 555,\n", " 'holding': 556,\n", " 'know': 557,\n", " 'study': 558,\n", " 'rule': 559,\n", " 'statistic': 560,\n", " 'rather': 561,\n", " 'generally': 562,\n", " 'democratic': 563,\n", " 'charge': 564,\n", " 'party': 565,\n", " 'signal': 566,\n", " 'prospect': 567,\n", " 'gdp': 568,\n", " 'conference': 569,\n", " 'estate': 570,\n", " 'climbed': 571,\n", " 'buyer': 572,\n", " 'found': 573,\n", " 'steady': 574,\n", " 'morgan': 575,\n", " 'hard': 576,\n", " 'fact': 577,\n", " 'reached': 578,\n", " 'modest': 579,\n", " 'across': 580,\n", " 'card': 581,\n", " 'others': 582,\n", " 'equity': 583,\n", " 'stimulus': 584,\n", " 'fixed': 585,\n", " 'black': 586,\n", " 'adviser': 587,\n", " 'getting': 588,\n", " 'course': 589,\n", " 'used': 590,\n", " 'mid': 591,\n", " 'followed': 592,\n", " 'raised': 593,\n", " 'instead': 594,\n", " 'revised': 595,\n", " 'often': 596,\n", " 'history': 597,\n", " 'especially': 598,\n", " 'jump': 599,\n", " 'beginning': 600,\n", " 'manufacturer': 601,\n", " 'final': 602,\n", " 'probably': 603,\n", " 'predicted': 604,\n", " 'example': 605,\n", " 'chicago': 606,\n", " 'customer': 607,\n", " 'significant': 608,\n", " 'sold': 609,\n", " 'minute': 610,\n", " 'seems': 611,\n", " 'closing': 612,\n", " 'given': 613,\n", " 'traded': 614,\n", " 'step': 615,\n", " 'center': 616,\n", " 'building': 617,\n", " 'benchmark': 618,\n", " 'commodity': 619,\n", " 'german': 620,\n", " 'package': 621,\n", " 'gap': 622,\n", " 'improvement': 623,\n", " 'shift': 624,\n", " 'position': 625,\n", " 'construction': 626,\n", " 'maryland': 627,\n", " 'virginia': 628,\n", " 'range': 629,\n", " 'advanced': 630,\n", " 'weakness': 631,\n", " 'cutting': 632,\n", " 'meet': 633,\n", " 'turned': 634,\n", " 'use': 635,\n", " 'started': 636,\n", " 'factory': 637,\n", " 'defense': 638,\n", " 'comment': 639,\n", " 'rebound': 640,\n", " 'single': 641,\n", " 'airline': 642,\n", " 'agreement': 643,\n", " 'rest': 644,\n", " 'act': 645,\n", " 'straight': 646,\n", " 'agreed': 647,\n", " 'attack': 648,\n", " 'sept': 649,\n", " 'list': 650,\n", " 'noted': 651,\n", " 'west': 652,\n", " 'got': 653,\n", " 'quickly': 654,\n", " 'heavy': 655,\n", " 'trillion': 656,\n", " 'side': 657,\n", " 'bear': 658,\n", " 'borrowing': 659,\n", " 'inventory': 660,\n", " 'moving': 661,\n", " 'possible': 662,\n", " 'idea': 663,\n", " 'try': 664,\n", " 'grow': 665,\n", " 'taken': 666,\n", " 'slump': 667,\n", " 'crash': 668,\n", " 'ford': 669,\n", " 'unchanged': 670,\n", " 'press': 671,\n", " 'strategist': 672,\n", " 'provide': 673,\n", " 'really': 674,\n", " 'clear': 675,\n", " 'uncertainty': 676,\n", " 'something': 677,\n", " 'proposed': 678,\n", " 'cause': 679,\n", " 'crude': 680,\n", " 'journal': 681,\n", " 'known': 682,\n", " 'continuing': 683,\n", " 'fee': 684,\n", " 'middle': 685,\n", " 'afternoon': 686,\n", " 'holiday': 687,\n", " 'initial': 688,\n", " 'bernanke': 689,\n", " 'finished': 690,\n", " 'hurt': 691,\n", " 'financing': 692,\n", " 'morning': 693,\n", " 'strategy': 694,\n", " 'largely': 695,\n", " 'ever': 696,\n", " 'officer': 697,\n", " 'worst': 698,\n", " 'pushed': 699,\n", " 'moved': 700,\n", " 'household': 701,\n", " 'light': 702,\n", " 'along': 703,\n", " 'plant': 704,\n", " 'c': 705,\n", " 'commission': 706,\n", " 'oct': 707,\n", " 'huge': 708,\n", " 'moderate': 709,\n", " 'motor': 710,\n", " 'utility': 711,\n", " 'carter': 712,\n", " 'stronger': 713,\n", " 'slower': 714,\n", " 'warned': 715,\n", " 'living': 716,\n", " 'employer': 717,\n", " 'whose': 718,\n", " 'declining': 719,\n", " 'discount': 720,\n", " 'aid': 721,\n", " 'behind': 722,\n", " 'increasing': 723,\n", " 'downturn': 724,\n", " 'break': 725,\n", " 'showing': 726,\n", " 'project': 727,\n", " 'bit': 728,\n", " 'relatively': 729,\n", " 'commercial': 730,\n", " 'expects': 731,\n", " 'drug': 732,\n", " 'suggests': 733,\n", " 'upward': 734,\n", " 'continues': 735,\n", " 'boom': 736,\n", " 'working': 737,\n", " 'dividend': 738,\n", " 'author': 739,\n", " 'easing': 740,\n", " 'fuel': 741,\n", " 'remained': 742,\n", " 'information': 743,\n", " 'changed': 744,\n", " 'eight': 745,\n", " 'balance': 746,\n", " 'positive': 747,\n", " 'sent': 748,\n", " 'saying': 749,\n", " 'available': 750,\n", " 'particularly': 751,\n", " 'certain': 752,\n", " 'tech': 753,\n", " 'basis': 754,\n", " 'seasonally': 755,\n", " 'saw': 756,\n", " 'class': 757,\n", " 'dealer': 758,\n", " 'portfolio': 759,\n", " 'different': 760,\n", " 'faster': 761,\n", " 'reading': 762,\n", " 'book': 763,\n", " 'focus': 764,\n", " 'response': 765,\n", " 'potential': 766,\n", " 'add': 767,\n", " 'equipment': 768,\n", " 'greater': 769,\n", " 'governor': 770,\n", " 'vote': 771,\n", " 'smaller': 772,\n", " 'issued': 773,\n", " 'additional': 774,\n", " 'event': 775,\n", " 'peak': 776,\n", " 'thought': 777,\n", " 'regulator': 778,\n", " 'jan': 779,\n", " 'slide': 780,\n", " 'matter': 781,\n", " 'paul': 782,\n", " 'seem': 783,\n", " 'never': 784,\n", " 'lender': 785,\n", " 'fallen': 786,\n", " 'david': 787,\n", " 'chance': 788,\n", " 'weekly': 789,\n", " 'legislation': 790,\n", " 'free': 791,\n", " 'climb': 792,\n", " 'needed': 793,\n", " 'retirement': 794,\n", " 'california': 795,\n", " 'trust': 796,\n", " 'related': 797,\n", " 'estimated': 798,\n", " 'analysis': 799,\n", " 'rail': 800,\n", " 'reform': 801,\n", " 'appears': 802,\n", " 'larger': 803,\n", " 'bureau': 804,\n", " 'active': 805,\n", " 'economics': 806,\n", " 'drive': 807,\n", " 'college': 808,\n", " 'kind': 809,\n", " 'worse': 810,\n", " 'corporation': 811,\n", " 'speech': 812,\n", " 'suggest': 813,\n", " 'offering': 814,\n", " 'ground': 815,\n", " 'deposit': 816,\n", " 'bring': 817,\n", " 'consecutive': 818,\n", " 'running': 819,\n", " 'surplus': 820,\n", " 'george': 821,\n", " 'serious': 822,\n", " 'core': 823,\n", " 'limit': 824,\n", " 'bet': 825,\n", " 'soared': 826,\n", " 'barrel': 827,\n", " 'went': 828,\n", " 'improved': 829,\n", " 'ease': 830,\n", " 'sluggish': 831,\n", " 'bull': 832,\n", " 'regional': 833,\n", " 'source': 834,\n", " 'debate': 835,\n", " 'trying': 836,\n", " 'inflationary': 837,\n", " 'direction': 838,\n", " 'broader': 839,\n", " 'mostly': 840,\n", " 'attention': 841,\n", " 'prime': 842,\n", " 'able': 843,\n", " 'william': 844,\n", " 'expert': 845,\n", " 'caused': 846,\n", " 'process': 847,\n", " 'weekend': 848,\n", " 'paid': 849,\n", " 'indicated': 850,\n", " 'review': 851,\n", " 'hiring': 852,\n", " 'pushing': 853,\n", " 'wholesale': 854,\n", " 'gas': 855,\n", " 'aug': 856,\n", " 'flat': 857,\n", " 'stay': 858,\n", " 'double': 859,\n", " 'similar': 860,\n", " 'attempt': 861,\n", " 'paper': 862,\n", " 'common': 863,\n", " 'risen': 864,\n", " 'presidential': 865,\n", " 'whole': 866,\n", " 'partner': 867,\n", " 'associated': 868,\n", " 'community': 869,\n", " 'let': 870,\n", " 'sen': 871,\n", " 'release': 872,\n", " 'difficult': 873,\n", " 'indeed': 874,\n", " 'worth': 875,\n", " 'received': 876,\n", " 'broker': 877,\n", " 'word': 878,\n", " 'widely': 879,\n", " 'anticipated': 880,\n", " 'robust': 881,\n", " 'machine': 882,\n", " 'answer': 883,\n", " 'spring': 884,\n", " 'warning': 885,\n", " 'temporary': 886,\n", " 'organization': 887,\n", " 'asia': 888,\n", " 'suggested': 889,\n", " 'negative': 890,\n", " 'germany': 891,\n", " 'gave': 892,\n", " 'main': 893,\n", " 'season': 894,\n", " 'track': 895,\n", " 'opportunity': 896,\n", " 'student': 897,\n", " 'conservative': 898,\n", " 'plunge': 899,\n", " 'game': 900,\n", " 'name': 901,\n", " 'ending': 902,\n", " 'include': 903,\n", " 'seemed': 904,\n", " 'brokerage': 905,\n", " 'sure': 906,\n", " 'bottom': 907,\n", " 'page': 908,\n", " 'tokyo': 909,\n", " 'asked': 910,\n", " 'dec': 911,\n", " 'finally': 912,\n", " 'adding': 913,\n", " 'th': 914,\n", " 'correction': 915,\n", " 'trouble': 916,\n", " 'weaker': 917,\n", " 'failed': 918,\n", " 'feel': 919,\n", " 'fight': 920,\n", " 'series': 921,\n", " 'cap': 922,\n", " 'sense': 923,\n", " 'surged': 924,\n", " 'seek': 925,\n", " 'closely': 926,\n", " 'nine': 927,\n", " 'always': 928,\n", " 'check': 929,\n", " 'special': 930,\n", " 'ap': 931,\n", " 'offered': 932,\n", " 'doubt': 933,\n", " 'fast': 934,\n", " 'canada': 935,\n", " 'done': 936,\n", " 'south': 937,\n", " 'situation': 938,\n", " 'wide': 939,\n", " 'men': 940,\n", " 'appeared': 941,\n", " 'keeping': 942,\n", " 'rapidly': 943,\n", " 'tell': 944,\n", " 'difference': 945,\n", " 'texas': 946,\n", " 'imf': 947,\n", " 'quarterly': 948,\n", " 'layoff': 949,\n", " 'picture': 950,\n", " 'san': 951,\n", " 'easy': 952,\n", " 'bigger': 953,\n", " 'brother': 954,\n", " 'taxpayer': 955,\n", " 'woman': 956,\n", " 'either': 957,\n", " 'actually': 958,\n", " 'beyond': 959,\n", " 'using': 960,\n", " 'spokesman': 961,\n", " 'giving': 962,\n", " 'consider': 963,\n", " 'addition': 964,\n", " 'forecaster': 965,\n", " 'volatility': 966,\n", " 'losing': 967,\n", " 'item': 968,\n", " 'bid': 969,\n", " 'nothing': 970,\n", " 'kept': 971,\n", " 'increasingly': 972,\n", " 'boston': 973,\n", " 'client': 974,\n", " 'slowed': 975,\n", " 'passed': 976,\n", " 'projected': 977,\n", " 'optimism': 978,\n", " 'role': 979,\n", " 'edged': 980,\n", " 'boosted': 981,\n", " 'usually': 982,\n", " 'consensus': 983,\n", " 'richard': 984,\n", " 'volcker': 985,\n", " 'interview': 986,\n", " 'voter': 987,\n", " 'military': 988,\n", " 'competition': 989,\n", " 'spend': 990,\n", " 'indicate': 991,\n", " 'possibility': 992,\n", " 'gasoline': 993,\n", " 'tumbled': 994,\n", " 'appear': 995,\n", " 'finish': 996,\n", " 'institute': 997,\n", " 'approved': 998,\n", " 'accounting': 999,\n", " 'asian': 1000,\n", " ...}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_index" ] }, { "cell_type": "code", "execution_count": 22, "id": "d8bc0b15", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "d8bc0b15", "outputId": "bc0c8ab2-9e8f-470d-a17e-89b96cd00afd" }, "outputs": [ { "data": { "text/plain": [ "20936" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "total_words" ] }, { "cell_type": "markdown", "id": "016c30d7", "metadata": { "id": "016c30d7" }, "source": [ "### Padding Sequences\n", "Sentences and sequences tend to have different lengths, however our model is expecting equally sized observations.\n", "Here we want to convert our texts to sequences and make them of the same length (in general, the lenght of the longest of our sequences). We are going to use here `pad_sequences` from Tensorflow (documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences)), to add zeroes to the tokenized sentences until they all reach the same length." ] }, { "cell_type": "code", "execution_count": 23, "id": "ea518a6d", "metadata": { "id": "ea518a6d" }, "outputs": [], "source": [ "sequences = tokenizer.texts_to_sequences(wtxt_train)\n", "padded_sequences = pad_sequences(sequences)" ] }, { "cell_type": "code", "execution_count": 24, "id": "6a3a19b1", "metadata": { "id": "6a3a19b1" }, "outputs": [ { "data": { "text/plain": [ "[143,\n", " 2582,\n", " 70,\n", " 297,\n", " 16,\n", " 7,\n", " 31,\n", " 143,\n", " 2405,\n", " 816,\n", " 932,\n", " 138,\n", " 18,\n", " 299,\n", " 2041,\n", " 286,\n", " 23,\n", " 297,\n", " 16,\n", " 1055,\n", " 400,\n", " 67,\n", " 130,\n", " 74,\n", " 13,\n", " 2,\n", " 214,\n", " 7581,\n", " 47,\n", " 2582,\n", " 609,\n", " 1698,\n", " 18,\n", " 33,\n", " 143,\n", " 285,\n", " 10,\n", " 816,\n", " 70,\n", " 16,\n", " 298,\n", " 49,\n", " 132,\n", " 18,\n", " 247,\n", " 6514,\n", " 68,\n", " 3,\n", " 12016,\n", " 3663,\n", " 743,\n", " 135,\n", " 98,\n", " 10,\n", " 47,\n", " 816,\n", " 33,\n", " 143,\n", " 1832,\n", " 16,\n", " 132,\n", " 6514,\n", " 50,\n", " 18,\n", " 6514,\n", " 247,\n", " 3307,\n", " 7,\n", " 31,\n", " 12017,\n", " 3308,\n", " 1042,\n", " 133,\n", " 12018,\n", " 214,\n", " 7581,\n", " 2582,\n", " 67,\n", " 1020,\n", " 772,\n", " 271,\n", " 1,\n", " 47,\n", " 2582,\n", " 1165,\n", " 6514,\n", " 4,\n", " 143,\n", " 98,\n", " 10,\n", " 285,\n", " 10,\n", " 75,\n", " 174,\n", " 609,\n", " 300,\n", " 1056,\n", " 2042,\n", " 1110,\n", " 286,\n", " 23,\n", " 311,\n", " 16,\n", " 2406]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequences[0]" ] }, { "cell_type": "code", "execution_count": 25, "id": "bf5a3374", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bf5a3374", "outputId": "99253544-bbca-48ca-e1d5-b86689138957" }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 0, 0, ..., 311, 16, 2406],\n", " [ 0, 0, 0, ..., 239, 232, 42],\n", " [ 0, 0, 0, ..., 325, 326, 23],\n", " ...,\n", " [ 0, 0, 0, ..., 203, 4375, 59],\n", " [ 0, 0, 0, ..., 159, 9, 169],\n", " [ 0, 0, 0, ..., 12015, 7108, 7444]])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "padded_sequences" ] }, { "cell_type": "code", "execution_count": 26, "id": "89dd0de2", "metadata": {}, "outputs": [], "source": [ "df['pad_seq'] = padded_sequences.tolist()" ] }, { "cell_type": "code", "execution_count": 27, "id": "4f444fcc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relevancepad_seq
01[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
41[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
51[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
91[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
121[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
.........
78100[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6770[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
47940[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
58690[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
29770[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
\n", "

2840 rows × 2 columns

\n", "
" ], "text/plain": [ " relevance pad_seq\n", "0 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "4 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "5 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "9 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "12 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "... ... ...\n", "7810 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "677 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "4794 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "5869 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "2977 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n", "\n", "[2840 rows x 2 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(['whole_txt'], axis = 1)" ] }, { "cell_type": "markdown", "id": "f45a3397-5ac9-411b-ad79-1142ae01e578", "metadata": {}, "source": [ "Here we end up with padded sequences and the binarly encoded relevance." ] }, { "cell_type": "code", "execution_count": 28, "id": "561a6b0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relevancewhole_txtpad_seq
01[yield, cd, fell, latest, week, new, york, yie...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
41[currency, trading, dollar, remains, tight, ra...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
51[stock, fall, bofa, alcoa, slide, stock, decli...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
91[u, dollar, fall, currency, decline, softened,...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
121[defending, deflation, author, james, stewart,...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
\n", "
" ], "text/plain": [ " relevance whole_txt \\\n", "0 1 [yield, cd, fell, latest, week, new, york, yie... \n", "4 1 [currency, trading, dollar, remains, tight, ra... \n", "5 1 [stock, fall, bofa, alcoa, slide, stock, decli... \n", "9 1 [u, dollar, fall, currency, decline, softened,... \n", "12 1 [defending, deflation, author, james, stewart,... \n", "\n", " pad_seq \n", "0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "5 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "9 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "12 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "id": "83826301", "metadata": {}, "source": [ "### Train-Test Split" ] }, { "cell_type": "markdown", "id": "6ed341c3-6983-4f48-ae74-fa16944a8e93", "metadata": {}, "source": [ "Over here we do the Train-Test Split, we designate the X and y variables using the padded sequences and revelance respectively. The split is done in proportions 80% to 20% using a random state, in order to mix the relevant and non relevant cases more less equaly by each split. Then from the product of the training split we create another split into the true train part of the data and the validation set, by 80-20% as well. At the end we finish with 3 sets train, validation and test. The sizes of each array are given below. The arrays for the y variable are turned into numpy arrays and they contain only integer values, since those are the only ones tensorflow will accept given a binary crossentropy. " ] }, { "cell_type": "code", "execution_count": 29, "id": "f00e30da", "metadata": {}, "outputs": [], "source": [ "X = padded_sequences\n", "y = df['relevance']" ] }, { "cell_type": "code", "execution_count": 30, "id": "fB_vM2GQkf-Y", "metadata": { "id": "fB_vM2GQkf-Y" }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": 31, "id": "4e0850ba", "metadata": {}, "outputs": [], "source": [ "X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": 32, "id": "19ab3a73", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 0, 0, ..., 61, 362, 66],\n", " [ 0, 0, 0, ..., 4569, 6281, 1252],\n", " [ 0, 0, 0, ..., 3415, 95, 4],\n", " ...,\n", " [ 0, 0, 0, ..., 479, 96, 64],\n", " [ 0, 0, 0, ..., 27, 246, 609],\n", " [ 0, 0, 0, ..., 10, 391, 24]])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train" ] }, { "cell_type": "code", "execution_count": 33, "id": "d51f8bcb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1817, 404)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 34, "id": "5f8059d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6965 1\n", "2156 0\n", "1103 0\n", "7486 1\n", "5865 1\n", " ..\n", "2245 1\n", "1956 1\n", "3711 0\n", "506 1\n", "3821 1\n", "Name: relevance, Length: 1817, dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "code", "execution_count": 35, "id": "128bd1f8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1817,)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.shape" ] }, { "cell_type": "code", "execution_count": 36, "id": "adde4bc3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(455,)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_val.shape" ] }, { "cell_type": "code", "execution_count": 37, "id": "aa0a8eba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(568,)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "616e3170-d789-4ac1-a12d-fea71a5dcab4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 38, "id": "0f0eab78", "metadata": {}, "outputs": [], "source": [ "y_train = np.array(y_train)\n", "y_val = np.array(y_val)\n", "y_test = np.array(y_test)" ] }, { "cell_type": "code", "execution_count": 39, "id": "35c078ca", "metadata": {}, "outputs": [], "source": [ "y_train = y_train.astype('int')\n", "y_val = y_val.astype('int')\n", "y_test = y_test.astype('int')" ] }, { "cell_type": "markdown", "id": "0fe8e8b2", "metadata": { "id": "0fe8e8b2" }, "source": [ "### Building the model\n", "\n", "We are going to build multiple models that include:\n", "- `Embedding` layer with an output representation of each word as a vector of dim 100, 200 or 300\n", "- `LSTM` with an intermediate state of 100 nodes, though this number can vary depending on the model in subject\n", "- An output layer `Dense` that connects the output of the LSTM and creates an output of 1. It either activates if found relevant or not if otherwise. It uses a sigmoid activation which traverses between a 0 and a 1.\n", "- `Dropout` a function that drops a given percentage of links in a random manner after layer training, a good option to try to limit the overfitting effects.\n", "- `Bidirectional` a both way LSTM layer, by that it captures both past and future information to train on." ] }, { "cell_type": "markdown", "id": "f36d316c-3966-4939-b735-79dd2d1ff936", "metadata": {}, "source": [ "### Early Stopping\n", "Early Stopping allows us to stop training in order to aviod overfitting as soon as we are getting same or worse loss scores on the validation set. Such a stop is executed when the loss drop occures 3 times in a row. For a poorer accuracy, it restores the previous better weights." ] }, { "cell_type": "code", "execution_count": 40, "id": "02feafd7-481d-4930-8680-4a873789e53d", "metadata": {}, "outputs": [], "source": [ "early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)" ] }, { "cell_type": "markdown", "id": "1101ccfc", "metadata": { "id": "1101ccfc" }, "source": [ "# Training the models" ] }, { "cell_type": "markdown", "id": "f64f5dc1", "metadata": {}, "source": [ "## Model Building\n" ] }, { "cell_type": "markdown", "id": "41b902f6", "metadata": {}, "source": [ "Regarding model building, we first tried to create a solid model through in depth hyperparameter tuning, but our results were always roughly the same; slightly better than random. We were obtaining training accuracy of 1.0 and test accuracy of 0.83 at most. These results may seem satisfying at first, but we actually had an imbalanced dataset. In fact, our y variable contained 82% of articles that were “economically irrelevant” and only 18% of relevant. Our first intuition was to use 100, 200 or 300 words for the final model in training and to make an excel to keep track of the model parameters used, training parameters and results. \n", "\n", "We then decided to employ regularization methods to reduce over-fitting. We also used graphs to see the loss and what happens at each epoch by defining a history variable and then plotting history when training is done. We did all of this to gain insights and these steps helped us figure out the deeper problem: the imbalance of the y variable in our original dataset. Thus, we finally chose to balance our original dataset by using as many relevant as irrelevant articles, while keeping all of the relevant ones of which we only had 18% in our original dataset. So our final dataset consists of 50% of irrelevant articles and 50% of relevant ones. We then re-ran all our code and models on this new dataset and our results drastically improved, as you can see below.\n", "\n", "Other models were run along in different files, using different configurations, however the document you are reading worked the best for our given task, those incluse:\n", "* The original 80 - 20 % Full Text modeling (Worst Performing)\n", "* The 50 - 50 % Full Text modeling (Best Performing, DESCRIBED HERE)\n", "* The 50 - 50 % Headlines Only modeling (Worse Performance by around ~ 10%)\n", "* The 50 - 50 % Headlines + First N Words of Text modeling (Slightly Worse Performance Highly Dependant on the Value of N)\n", "* The 50 - 50 % Full Text 'word2vec' modeling (Depending on model, slightly better or worse)\n", "\n", "In the word2vec case we decided to keep it out due to computation complications, not much better results often slightly worse, longer waiting times and over all it increased vastly the model complexity, which we wanted to avoid.\n", "\n", "All files can be found on GitHub, some of them are not described and in a more 'dirty' format.\n", "\n", "https://github.com/Majon911/EconNewsMLIdent" ] }, { "cell_type": "markdown", "id": "55429b96", "metadata": {}, "source": [ "### MODEL 1 (The base model)" ] }, { "cell_type": "markdown", "id": "92e8cb02", "metadata": {}, "source": [ "* Our base model defines LSTM as a great foundation for its usability in sequence data such as NLP \n", "* On the other side, we come with ideally the final output from a classification of all outputs from the LSTM, and use the dense layer to reduce to that one most likely prediction; hence the '1' of output dimension. Finally, the adam optimizer was initialized with, which would tend to overfit through the model training.\n", "\n", "- At 56% validation accuracy, this model overfits at 99%." ] }, { "cell_type": "code", "execution_count": 41, "id": "ed54e838-0fc1-4c04-bc93-08c58aa4804c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From C:\\Users\\majon\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\keras\\src\\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.\n", "\n", "WARNING:tensorflow:From C:\\Users\\majon\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\keras\\src\\optimizers\\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.\n", "\n" ] } ], "source": [ "# We are going to build our model with the Sequential API\n", "model = Sequential()\n", "model.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "model.add(LSTM(100, return_sequences=False))\n", "model.add(Dense(1, activation='sigmoid')) # Change activation based on the number of classes\n", "\n", "model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 42, "id": "db0965d6-b675-4106-82bc-e28b1f13e9d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model(padded_sequences)" ] }, { "cell_type": "code", "execution_count": 43, "id": "d96f15a2-949e-4342-9550-6454ea7429a8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding (Embedding) (None, 404, 100) 2093600 \n", " \n", " lstm (LSTM) (None, 100) 80400 \n", " \n", " dense (Dense) (None, 1) 101 \n", " \n", "=================================================================\n", "Total params: 2174101 (8.29 MB)\n", "Trainable params: 2174101 (8.29 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model.summary()" ] }, { "cell_type": "code", "execution_count": 44, "id": "826df8ad", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "826df8ad", "outputId": "954999b9-e6f3-43a2-dc81-a5c10a8e6ef1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/10\n", "WARNING:tensorflow:From C:\\Users\\majon\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\keras\\src\\utils\\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead.\n", "\n", "WARNING:tensorflow:From C:\\Users\\majon\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\keras\\src\\engine\\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead.\n", "\n", "57/57 [==============================] - 13s 195ms/step - loss: 0.6880 - accuracy: 0.5498 - val_loss: 0.6759 - val_accuracy: 0.6484\n", "Epoch 2/10\n", "57/57 [==============================] - 9s 162ms/step - loss: 0.5424 - accuracy: 0.7870 - val_loss: 0.6902 - val_accuracy: 0.5978\n", "Epoch 3/10\n", "57/57 [==============================] - 9s 163ms/step - loss: 0.2517 - accuracy: 0.9086 - val_loss: 0.9096 - val_accuracy: 0.6066\n", "Epoch 4/10\n", "57/57 [==============================] - 10s 172ms/step - loss: 0.0673 - accuracy: 0.9818 - val_loss: 1.2419 - val_accuracy: 0.5824\n", "Epoch 5/10\n", "57/57 [==============================] - 10s 172ms/step - loss: 0.0200 - accuracy: 0.9961 - val_loss: 1.4963 - val_accuracy: 0.5670\n", "Epoch 6/10\n", "57/57 [==============================] - 9s 152ms/step - loss: 0.0251 - accuracy: 0.9950 - val_loss: 1.3650 - val_accuracy: 0.5604\n", "Epoch 7/10\n", "57/57 [==============================] - 9s 159ms/step - loss: 0.0163 - accuracy: 0.9950 - val_loss: 1.8727 - val_accuracy: 0.5868\n", "Epoch 8/10\n", "57/57 [==============================] - 9s 158ms/step - loss: 0.0078 - accuracy: 0.9978 - val_loss: 1.6632 - val_accuracy: 0.5736\n", "Epoch 9/10\n", "57/57 [==============================] - 10s 169ms/step - loss: 0.0045 - accuracy: 0.9989 - val_loss: 2.0763 - val_accuracy: 0.5978\n", "Epoch 10/10\n", "57/57 [==============================] - 10s 178ms/step - loss: 0.0041 - accuracy: 0.9989 - val_loss: 1.6814 - val_accuracy: 0.5692\n" ] } ], "source": [ "hist = model.fit(X_train, y_train, epochs=10, validation_data = (X_val, y_val))" ] }, { "cell_type": "code", "execution_count": 46, "id": "286d9981", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'loss': [0.6879979968070984,\n", " 0.5423904061317444,\n", " 0.25165560841560364,\n", " 0.0672721192240715,\n", " 0.01998290978372097,\n", " 0.02507256343960762,\n", " 0.01631668023765087,\n", " 0.007824108004570007,\n", " 0.004535376559942961,\n", " 0.004056017845869064],\n", " 'accuracy': [0.5498073697090149,\n", " 0.7870115637779236,\n", " 0.9086406230926514,\n", " 0.9818381667137146,\n", " 0.9961475133895874,\n", " 0.9950467944145203,\n", " 0.9950467944145203,\n", " 0.9977985620498657,\n", " 0.9988992810249329,\n", " 0.9988992810249329],\n", " 'val_loss': [0.675902247428894,\n", " 0.6902332901954651,\n", " 0.9096015691757202,\n", " 1.2418750524520874,\n", " 1.4963206052780151,\n", " 1.3650264739990234,\n", " 1.8727422952651978,\n", " 1.663246750831604,\n", " 2.076326608657837,\n", " 1.6813592910766602],\n", " 'val_accuracy': [0.6483516693115234,\n", " 0.5978022217750549,\n", " 0.6065934300422668,\n", " 0.5824176073074341,\n", " 0.5670329928398132,\n", " 0.5604395866394043,\n", " 0.58681321144104,\n", " 0.5736263990402222,\n", " 0.5978022217750549,\n", " 0.5692307949066162]}" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hist.history" ] }, { "cell_type": "code", "execution_count": 47, "id": "b2af0489", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist.history['loss'])\n", "plt.plot(hist.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 111, "id": "6ea199d8-f2c3-4288-a9e3-23d0b7ca3ec9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 58ms/step - loss: 1.6814 - accuracy: 0.5692\n" ] } ], "source": [ "loss, accuracy = model.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "2bfff715-714f-40a8-9a43-4b7bba724575", "metadata": {}, "source": [ "As we can see the model using Adam optimizer, vastly overfits as we can see on the graph, the loss line drops dramatically to nealy 0 just after the first 3 epochs. That had to be changed because with such an agressive rate, we will often overfit and the scores on the validation set were not much better." ] }, { "cell_type": "markdown", "id": "781be31e", "metadata": {}, "source": [ "### Model 1 Testing" ] }, { "cell_type": "markdown", "id": "d772ed0c-9a19-44c8-b920-c64118782de0", "metadata": {}, "source": [ "For comparision reasons, we decided to run a test using this model, the test set accuracy is given below and does not stand away from the validation accuracy. The threshold is set to 0.5, that means >0.5 means a positive case, below means a negative case." ] }, { "cell_type": "code", "execution_count": 267, "id": "76c6df0e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/18 [==============================] - 1s 64ms/step - loss: 1.6087 - accuracy: 0.5757\n" ] } ], "source": [ "loss, accuracy = model.evaluate(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 268, "id": "f3526a9c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/18 [==============================] - 1s 56ms/step\n" ] } ], "source": [ "#Prection and Confusion Matrix\n", "y_pred = model.predict(X_test)\n", "bin_y_pred = (y_pred > 0.5).astype(int)" ] }, { "cell_type": "code", "execution_count": 269, "id": "062c9c2c", "metadata": {}, "outputs": [], "source": [ "bin_y_pred = np.squeeze(bin_y_pred)" ] }, { "cell_type": "code", "execution_count": 270, "id": "23edf0b7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,\n", " 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,\n", " 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,\n", " 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,\n", " 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1,\n", " 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1,\n", " 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,\n", " 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,\n", " 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,\n", " 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0,\n", " 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,\n", " 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,\n", " 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,\n", " 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,\n", " 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,\n", " 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,\n", " 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,\n", " 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0,\n", " 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,\n", " 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0,\n", " 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,\n", " 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0,\n", " 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,\n", " 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,\n", " 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0,\n", " 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1])" ] }, "execution_count": 270, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bin_y_pred" ] }, { "cell_type": "code", "execution_count": 271, "id": "e345456a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Predicted No Predicted Yes \n", "Actual No 152 130 \n", "Actual Yes 111 175 \n", "\n", "Precision: 0.5738\n", "Recall: 0.6119\n", "Accuracy: 0.5757\n" ] } ], "source": [ "y_true = y_test\n", "y_pred = bin_y_pred\n", "\n", "cm = confusion_matrix(y_true, y_pred)\n", "\n", "TN, FP, FN, TP = cm.ravel()\n", "\n", "print(f\"{'':<20}{'Predicted No':<20}{'Predicted Yes':<20}\")\n", "print(f\"{'Actual No':<20}{TN:<20}{FP:<20}\")\n", "print(f\"{'Actual Yes':<20}{FN:<20}{TP:<20}\")\n", "\n", "print(\"\\nPrecision:\", round(TP/(TP + FP), 4))\n", "print(\"Recall:\", round(TP/(TP + FN), 4))\n", "print(\"Accuracy:\", round((TP+TN)/(TP + TN + FP + FN), 4))" ] }, { "cell_type": "markdown", "id": "04a80b4c-9237-499c-b40a-7fdfb4af6aba", "metadata": {}, "source": [ "The precision for the base model is 0.57, above we can see its confusion matrix and precision and recall values." ] }, { "cell_type": "markdown", "id": "f5437a9f", "metadata": {}, "source": [ "### MODEL 2" ] }, { "cell_type": "markdown", "id": "cef564db", "metadata": {}, "source": [ "* In this model, the idea of the Dropout layer's implementation is aimed at the reduction of parameter memorization to make the model reduce overfitting with the train data, the 0.2 drops around 20% of the output parameters from the LSTM layer before feeding the Dense layer. This model showed great improvement with 65% validation acuracy." ] }, { "cell_type": "code", "execution_count": 124, "id": "e790349c", "metadata": {}, "outputs": [], "source": [ "# We are going to build our model with the Sequential API\n", "model2 = Sequential()\n", "\n", "model2.add(Embedding(total_words, # number of words to process as input\n", " 50, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model2.add(LSTM(50, return_sequences=False))\n", "\n", "model2.add(Dropout(0.2))\n", "\n", "model2.add(Dense(1, activation='sigmoid')) \n", "\n", "model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 125, "id": "a4fded32", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_12\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_12 (Embedding) (None, 404, 50) 1046800 \n", " \n", " lstm_13 (LSTM) (None, 50) 20200 \n", " \n", " dropout_8 (Dropout) (None, 50) 0 \n", " \n", " dense_12 (Dense) (None, 1) 51 \n", " \n", "=================================================================\n", "Total params: 1067051 (4.07 MB)\n", "Trainable params: 1067051 (4.07 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model2.summary()" ] }, { "cell_type": "code", "execution_count": 126, "id": "2cc28f7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/10\n", "57/57 [==============================] - 11s 151ms/step - loss: 0.6919 - accuracy: 0.5311 - val_loss: 0.6797 - val_accuracy: 0.6659\n", "Epoch 2/10\n", "57/57 [==============================] - 7s 123ms/step - loss: 0.5895 - accuracy: 0.7369 - val_loss: 0.6224 - val_accuracy: 0.6593\n", "Epoch 3/10\n", "57/57 [==============================] - 7s 125ms/step - loss: 0.3292 - accuracy: 0.8690 - val_loss: 0.8046 - val_accuracy: 0.5978\n", "Epoch 4/10\n", "57/57 [==============================] - 7s 127ms/step - loss: 0.1286 - accuracy: 0.9576 - val_loss: 1.1220 - val_accuracy: 0.5824\n", "Epoch 5/10\n", "57/57 [==============================] - 7s 126ms/step - loss: 0.0530 - accuracy: 0.9862 - val_loss: 1.2956 - val_accuracy: 0.5670\n" ] } ], "source": [ "hist2 = model2.fit(X_train, y_train, epochs=10, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 127, "id": "4efbb897-0571-4ccc-a81f-c9e34aff2f71", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist2.history['loss'])\n", "plt.plot(hist2.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 128, "id": "a59b099a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 42ms/step - loss: 0.6224 - accuracy: 0.6593\n" ] } ], "source": [ "loss, accuracy = model2.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "81711f2f-c8ac-47ed-9e98-3e0085a94d40", "metadata": {}, "source": [ "Thanks to the dropout layer, we reduced the effects of overfitting, and increased our accuracy by a good 5% on validation data. In here we also reduced the number of nodes in the LSTM layer to 50 to try to reduce the effects of overfitting." ] }, { "cell_type": "markdown", "id": "83c8e04a", "metadata": {}, "source": [ "### MODEL 3" ] }, { "cell_type": "markdown", "id": "875b245c", "metadata": {}, "source": [ "* Model 3, along with the following models begins to explore the optimization of parameters, given the increase of output representation and a reduction of nodes at the dropout layer. Accuracy: 62%" ] }, { "cell_type": "code", "execution_count": 113, "id": "cce897ae", "metadata": {}, "outputs": [], "source": [ "model3 = Sequential()\n", "\n", "model3.add(Embedding(total_words, # number of words to process as input\n", " 200, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model3.add(LSTM(200, return_sequences=False))\n", "\n", "model3.add(Dropout(0.2))\n", "\n", "model3.add(Dense(1, activation='sigmoid')) \n", "\n", "model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 114, "id": "8a14b1ec", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_11\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_11 (Embedding) (None, 404, 200) 4187200 \n", " \n", " lstm_12 (LSTM) (None, 200) 320800 \n", " \n", " dropout_7 (Dropout) (None, 200) 0 \n", " \n", " dense_11 (Dense) (None, 1) 201 \n", " \n", "=================================================================\n", "Total params: 4508201 (17.20 MB)\n", "Trainable params: 4508201 (17.20 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model3.summary()" ] }, { "cell_type": "code", "execution_count": 115, "id": "6963b196", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/10\n", "57/57 [==============================] - 34s 566ms/step - loss: 0.6986 - accuracy: 0.5608 - val_loss: 0.6696 - val_accuracy: 0.6220\n", "Epoch 2/10\n", "57/57 [==============================] - 36s 629ms/step - loss: 0.5054 - accuracy: 0.7947 - val_loss: 0.6990 - val_accuracy: 0.5978\n", "Epoch 3/10\n", "57/57 [==============================] - 34s 606ms/step - loss: 0.1986 - accuracy: 0.9290 - val_loss: 0.9651 - val_accuracy: 0.6110\n", "Epoch 4/10\n", "57/57 [==============================] - 36s 635ms/step - loss: 0.0652 - accuracy: 0.9829 - val_loss: 1.4469 - val_accuracy: 0.6022\n" ] } ], "source": [ "hist3 = model3.fit(X_train, y_train, epochs=10, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 116, "id": "a49a0850", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist3.history['loss'])\n", "plt.plot(hist3.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 117, "id": "e94caaa5-f907-4bf3-afe9-a45be3bebeef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 2s 152ms/step - loss: 0.6696 - accuracy: 0.6220\n" ] } ], "source": [ "loss, accuracy = model3.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "d7a321ac-1bfd-4742-8f43-03ffbf2200ae", "metadata": {}, "source": [ "The increase of the LSTM nodes and the output representation to 200, proved to be a step in the wrong direction, with a 2-3% lower accuracy then model 2. " ] }, { "cell_type": "markdown", "id": "1c8c7755", "metadata": {}, "source": [ "### MODEL 4" ] }, { "cell_type": "markdown", "id": "d765a669", "metadata": {}, "source": [ "* Model's 4 change of model compile modifies the type of optimizer to 'sgd' since adam might have given a high learning rate to the model. Accuracy: 49%. The stochastic gradient descent proved to be not a good fit for our model." ] }, { "cell_type": "code", "execution_count": 73, "id": "38ed8871", "metadata": {}, "outputs": [], "source": [ "model4 = Sequential()\n", "\n", "model4.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model4.add(LSTM(100, return_sequences=False))\n", "\n", "model4.add(Dense(1, activation='sigmoid')) \n", "\n", "model4.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 74, "id": "d97c51d9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_5\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_5 (Embedding) (None, 404, 100) 2093600 \n", " \n", " lstm_5 (LSTM) (None, 100) 80400 \n", " \n", " dense_5 (Dense) (None, 1) 101 \n", " \n", "=================================================================\n", "Total params: 2174101 (8.29 MB)\n", "Trainable params: 2174101 (8.29 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model4.summary()" ] }, { "cell_type": "code", "execution_count": 75, "id": "7fba69cf", "metadata": { "id": "7fba69cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/5\n", "57/57 [==============================] - 10s 148ms/step - loss: 0.6932 - accuracy: 0.5003 - val_loss: 0.6930 - val_accuracy: 0.4967\n", "Epoch 2/5\n", "57/57 [==============================] - 8s 139ms/step - loss: 0.6932 - accuracy: 0.5019 - val_loss: 0.6930 - val_accuracy: 0.4879\n", "Epoch 3/5\n", "57/57 [==============================] - 8s 137ms/step - loss: 0.6931 - accuracy: 0.5047 - val_loss: 0.6930 - val_accuracy: 0.4901\n", "Epoch 4/5\n", "57/57 [==============================] - 8s 134ms/step - loss: 0.6931 - accuracy: 0.5030 - val_loss: 0.6930 - val_accuracy: 0.4857\n" ] } ], "source": [ "hist4 = model4.fit(X_train, y_train, epochs=10, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 76, "id": "50cd2cc6-4408-4033-bc1d-60b13e15b535", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist4.history['loss'])\n", "plt.plot(hist4.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 118, "id": "d00f224d-205d-4f66-a2e7-27d82a0a9748", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 72ms/step - loss: 0.6930 - accuracy: 0.4967\n" ] } ], "source": [ "loss, accuracy = model4.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "fcf52a58-f735-4808-887c-09cad201429e", "metadata": {}, "source": [ "### MODEL 5 (Top Performer)" ] }, { "cell_type": "markdown", "id": "a90301e0", "metadata": {}, "source": [ " * Model 5, deemed to as the Steroid Model, is significantly the best model with a validation `accuracy of 71.9%`. This model is very particular due to its transfer of output dimensions that are back checked from interaction between multiple layers in the model. There are 2 bidirectoinal layers that is filtered through a dropout layer and returned for further refinement. Overall, the data outputs traverse the dropout layer 3 times, which give room to find the keywords perhaps more efficiently than other models.\n", " * Different values of the output representation, LSTM layer nodes, learning rates and dropout values have been tested, using different optimizers. Adamax with the default learning rate of 0.001, has been found the best performing.\n", " * 100 in the LSTM layer has been found to be the most efficent.\n", " * This method also dramatically overfits reaching the train accuracy of 90 at 4th or 5th epoch, but has been found to still capture the most information, out of all tested models.\n", " * The dropout rate has been set to 0.2, since more strickter dropout rates did not improve the performance.\n", " * Bidirectional Layers proved to be better performing then one way ones." ] }, { "cell_type": "code", "execution_count": 78, "id": "278f2233-97c0-4479-b19a-5ce5fe5f3398", "metadata": {}, "outputs": [], "source": [ "adamax_opt = Adamax(learning_rate = 0.001)" ] }, { "cell_type": "code", "execution_count": 79, "id": "20646d43-0589-42c9-a753-c66d1252d1a9", "metadata": {}, "outputs": [], "source": [ "model5 = Sequential()\n", "\n", "model5.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " mask_zero = True,\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model5.add(Bidirectional(tf.keras.layers.LSTM(100, return_sequences=True)))\n", "\n", "model5.add(Dropout(0.2)) \n", "\n", "model5.add(Bidirectional(tf.keras.layers.LSTM(100, return_sequences=False)))\n", "#model5.add(LSTM(100, return_sequences=False))\n", "\n", "model5.add(Dropout(0.2)) \n", "\n", "model5.add(Dense(1, activation='sigmoid')) \n", "\n", "model5.compile(optimizer=adamax_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 80, "id": "3ff04abe-608a-4139-8dc7-c2605feac14b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_6\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_6 (Embedding) (None, 404, 100) 2093600 \n", " \n", " bidirectional (Bidirection (None, 404, 200) 160800 \n", " al) \n", " \n", " dropout_1 (Dropout) (None, 404, 200) 0 \n", " \n", " bidirectional_1 (Bidirecti (None, 200) 240800 \n", " onal) \n", " \n", " dropout_2 (Dropout) (None, 200) 0 \n", " \n", " dense_6 (Dense) (None, 1) 201 \n", " \n", "=================================================================\n", "Total params: 2495401 (9.52 MB)\n", "Trainable params: 2495401 (9.52 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model5.summary()" ] }, { "cell_type": "code", "execution_count": 81, "id": "ac57617c-6963-4d24-8134-b48f7a849880", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/10\n", "57/57 [==============================] - 47s 638ms/step - loss: 0.6855 - accuracy: 0.5394 - val_loss: 0.6210 - val_accuracy: 0.6901\n", "Epoch 2/10\n", "57/57 [==============================] - 35s 609ms/step - loss: 0.5429 - accuracy: 0.7386 - val_loss: 0.5775 - val_accuracy: 0.7187\n", "Epoch 3/10\n", "57/57 [==============================] - 35s 611ms/step - loss: 0.3732 - accuracy: 0.8465 - val_loss: 0.7202 - val_accuracy: 0.6967\n", "Epoch 4/10\n", "57/57 [==============================] - 34s 592ms/step - loss: 0.1863 - accuracy: 0.9329 - val_loss: 0.7666 - val_accuracy: 0.6571\n", "Epoch 5/10\n", "57/57 [==============================] - 33s 574ms/step - loss: 0.0759 - accuracy: 0.9758 - val_loss: 1.0978 - val_accuracy: 0.6615\n" ] } ], "source": [ "hist5 = model5.fit(X_train, y_train, epochs=10, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 82, "id": "2baeb4c7-428f-4c70-8130-cd73eb71441f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist5.history['loss'])\n", "plt.plot(hist5.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 119, "id": "5d004f3b-e42c-478f-a440-c4500e03ce01", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 3s 206ms/step - loss: 0.5775 - accuracy: 0.7187\n" ] } ], "source": [ "loss, accuracy = model5.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "b1f521ed-bd72-48a6-826b-7aa8878b3aef", "metadata": {}, "source": [ "We can see that the loss graph is way more streched out, reducing the previous effects we had using the normal adam optimizer and no dropout. We have less of an overfitting situation. " ] }, { "cell_type": "markdown", "id": "8b222fa6-1e58-4032-a518-f7918a7f982d", "metadata": {}, "source": [ "### MODEL 6" ] }, { "cell_type": "markdown", "id": "c75866f3", "metadata": {}, "source": [ "* Model 6 eplores the modification to improve the hyperparameters of models with output dmension of 50, and and regularized learning rate of adamW optimizer at 0.001.\n", "* Accuracy: 60%\n", "* AdamW is a weight decay technique that penalizes large weigths, in order to prevent the effects of overfitting.\n", "* The 50 nodes in LSTM and output representation have been restored in this case, since they performed the best." ] }, { "cell_type": "code", "execution_count": 86, "id": "99ea7547-edbd-4c32-9f3e-bc26c4c213c4", "metadata": {}, "outputs": [], "source": [ "adamw_optimizer = tfa.optimizers.AdamW(learning_rate=1e-2, weight_decay=1e-4)" ] }, { "cell_type": "code", "execution_count": 87, "id": "c7263348-b682-482e-aac9-e53812a070cc", "metadata": {}, "outputs": [], "source": [ "# We are going to build our model with the Sequential API\n", "model6 = Sequential()\n", "\n", "model6.add(Embedding(total_words, # number of words to process as input\n", " 50, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model6.add(LSTM(50, return_sequences=False))\n", "\n", "model6.add(Dropout(0.2))\n", "\n", "model6.add(Dense(1, activation='sigmoid')) \n", "\n", "model6.compile(optimizer= adamw_optimizer, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 88, "id": "ab7aa008-9d0a-458c-be27-aef409a2fdf9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_7\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_7 (Embedding) (None, 404, 50) 1046800 \n", " \n", " lstm_8 (LSTM) (None, 50) 20200 \n", " \n", " dropout_3 (Dropout) (None, 50) 0 \n", " \n", " dense_7 (Dense) (None, 1) 51 \n", " \n", "=================================================================\n", "Total params: 1067051 (4.07 MB)\n", "Trainable params: 1067051 (4.07 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model6.summary()" ] }, { "cell_type": "code", "execution_count": 89, "id": "8b373df0-6b5a-409c-a7a2-265c5dc23c79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/10\n", "57/57 [==============================] - 9s 124ms/step - loss: 0.6816 - accuracy: 0.5740 - val_loss: 0.6954 - val_accuracy: 0.6044\n", "Epoch 2/10\n", "57/57 [==============================] - 6s 114ms/step - loss: 0.3534 - accuracy: 0.8558 - val_loss: 0.9911 - val_accuracy: 0.5495\n", "Epoch 3/10\n", "57/57 [==============================] - 7s 115ms/step - loss: 0.1327 - accuracy: 0.9466 - val_loss: 1.2453 - val_accuracy: 0.5758\n", "Epoch 4/10\n", "57/57 [==============================] - 7s 117ms/step - loss: 0.0496 - accuracy: 0.9851 - val_loss: 1.3434 - val_accuracy: 0.5604\n" ] } ], "source": [ "hist6 = model6.fit(X_train, y_train, epochs=10, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 90, "id": "ecddc65e-96ac-4fed-85e6-d061ffe486a4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist6.history['loss'])\n", "plt.plot(hist6.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 120, "id": "6dae0a71-8ed3-4136-a030-6f434c8adef8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 47ms/step - loss: 0.6954 - accuracy: 0.6044\n" ] } ], "source": [ "loss, accuracy = model6.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "1251849b-40b8-4adf-8512-84cfe33a8f09", "metadata": {}, "source": [ "The effects of overfitting are not as well handled as in model 5, as seen in the graph above." ] }, { "cell_type": "markdown", "id": "6805c04a-2abc-4b8f-b73c-af4bbb1d3f72", "metadata": {}, "source": [ "### MODEL 7" ] }, { "cell_type": "markdown", "id": "d362d5f2", "metadata": {}, "source": [ "* This model once again explores the further regularization of the adamax optimizer at 0.001\n", "* Accuracy: 67%\n", "* The thought is simple, we want to see how the Adamax optimizer performs on a simpler model.\n", "* The 100 in LSTM and output representation / embedding layers proved to work better then 50 for this optimizer.\n", "* Adamax is a much less agressive optimizer with adaptive learning rates, an adress made to the flaws of the original Adam optimizer. " ] }, { "cell_type": "code", "execution_count": 92, "id": "be5753f6-c52e-4de6-98b6-9e3228169e5b", "metadata": {}, "outputs": [], "source": [ "adamax_opt = Adamax(learning_rate = 0.001)" ] }, { "cell_type": "code", "execution_count": 93, "id": "cf7ebd27-9ff3-4a27-b5fd-96a33f79208f", "metadata": {}, "outputs": [], "source": [ "# We are going to build our model with the Sequential API\n", "model7 = Sequential()\n", "\n", "model7.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model7.add(LSTM(100, return_sequences=False))\n", "#model7.add(Bidirectional(LSTM(100, return_sequences=False)))\n", "\n", "model7.add(Dropout(0.2))\n", "\n", "model7.add(Dense(1, activation='sigmoid')) \n", "\n", "model7.compile(optimizer= adamax_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 94, "id": "96ec9753-bc8c-4a28-9210-d09f4159c992", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_8\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_8 (Embedding) (None, 404, 100) 2093600 \n", " \n", " lstm_9 (LSTM) (None, 100) 80400 \n", " \n", " dropout_4 (Dropout) (None, 100) 0 \n", " \n", " dense_8 (Dense) (None, 1) 101 \n", " \n", "=================================================================\n", "Total params: 2174101 (8.29 MB)\n", "Trainable params: 2174101 (8.29 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model7.summary()" ] }, { "cell_type": "code", "execution_count": 95, "id": "a0a474cc-4b69-4df7-836f-8231259a266d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/50\n", "57/57 [==============================] - 12s 173ms/step - loss: 0.6927 - accuracy: 0.5058 - val_loss: 0.6913 - val_accuracy: 0.5692\n", "Epoch 2/50\n", "57/57 [==============================] - 10s 171ms/step - loss: 0.6547 - accuracy: 0.6979 - val_loss: 0.6336 - val_accuracy: 0.6615\n", "Epoch 3/50\n", "57/57 [==============================] - 10s 178ms/step - loss: 0.5636 - accuracy: 0.7160 - val_loss: 0.6363 - val_accuracy: 0.6703\n", "Epoch 4/50\n", "57/57 [==============================] - 10s 183ms/step - loss: 0.4867 - accuracy: 0.7666 - val_loss: 0.6562 - val_accuracy: 0.6637\n", "Epoch 5/50\n", "57/57 [==============================] - 10s 181ms/step - loss: 0.4207 - accuracy: 0.8162 - val_loss: 0.7302 - val_accuracy: 0.6747\n" ] } ], "source": [ "hist7 = model7.fit(X_train, y_train, epochs=20, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 96, "id": "b92acd72-bd15-4ab2-9b45-fd26d1e69f58", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist7.history['loss'])\n", "plt.plot(hist7.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 121, "id": "4b9325fa-0d81-438e-beb8-6cf2df3bfbf7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 64ms/step - loss: 0.6336 - accuracy: 0.6615\n" ] } ], "source": [ "loss, accuracy = model7.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "2b4d31a1-eff3-4f03-a64b-34abc1cf06e1", "metadata": {}, "source": [ "This model performs slightly better then Adam for this exact configuration." ] }, { "cell_type": "markdown", "id": "da307f59-522c-48a0-9c07-cfbcf532a312", "metadata": {}, "source": [ "### MODEL 8" ] }, { "cell_type": "markdown", "id": "b073333d-9cce-43b6-8857-1c2337c6dce1", "metadata": {}, "source": [ "* In this model we are testing yet another optimizer, this time the Nadam one.\n", "* Nadam is an optimizer combining two different ones the Nesterov Accelerated Gradient (momentum incorporated) and Adam. NAG updates the parameters using a combination of the current gradient and a fraction of the previous update.\n", "* Accuracy: 65.71%" ] }, { "cell_type": "code", "execution_count": 98, "id": "19b03983-4797-42a3-9ace-dd1351b0768f", "metadata": {}, "outputs": [], "source": [ "nadam_opt = Nadam(learning_rate = 0.001)" ] }, { "cell_type": "code", "execution_count": 99, "id": "ed2d7187-12c4-42b5-b2a6-b5c18a96a650", "metadata": {}, "outputs": [], "source": [ "# We are going to build our model with the Sequential API\n", "model8 = Sequential()\n", "\n", "model8.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model8.add(LSTM(100, return_sequences=False))\n", "\n", "model8.add(Dropout(0.2))\n", "\n", "model8.add(Dense(1, activation='sigmoid')) \n", "\n", "model8.compile(optimizer= nadam_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 100, "id": "1530cfa2-dedd-4983-b068-54be6e7e04f0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_9\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_9 (Embedding) (None, 404, 100) 2093600 \n", " \n", " lstm_10 (LSTM) (None, 100) 80400 \n", " \n", " dropout_5 (Dropout) (None, 100) 0 \n", " \n", " dense_9 (Dense) (None, 1) 101 \n", " \n", "=================================================================\n", "Total params: 2174101 (8.29 MB)\n", "Trainable params: 2174101 (8.29 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model8.summary()" ] }, { "cell_type": "code", "execution_count": 101, "id": "0ac3e281-fadd-4907-b52d-54c0c1788f01", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/20\n", "57/57 [==============================] - 13s 194ms/step - loss: 0.6985 - accuracy: 0.5515 - val_loss: 0.6736 - val_accuracy: 0.6571\n", "Epoch 2/20\n", "57/57 [==============================] - 10s 176ms/step - loss: 0.5126 - accuracy: 0.7760 - val_loss: 0.6824 - val_accuracy: 0.6352\n", "Epoch 3/20\n", "57/57 [==============================] - 10s 177ms/step - loss: 0.2061 - accuracy: 0.9241 - val_loss: 1.0434 - val_accuracy: 0.6242\n", "Epoch 4/20\n", "57/57 [==============================] - 10s 177ms/step - loss: 0.0726 - accuracy: 0.9774 - val_loss: 1.2087 - val_accuracy: 0.5868\n" ] } ], "source": [ "hist8 = model8.fit(X_train, y_train, epochs=20, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 102, "id": "74c0b3dd-f79f-4978-96a9-b0538c53a717", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist8.history['loss'])\n", "plt.plot(hist8.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 122, "id": "ca96c236-0eb9-4d8b-b6be-acd41710172e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 65ms/step - loss: 0.6736 - accuracy: 0.6571\n" ] } ], "source": [ "loss, accuracy = model8.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "17441e14-be4e-4ae8-b74e-986a738b662b", "metadata": {}, "source": [ "Its performance does not stand out from a normal adam optimizer." ] }, { "cell_type": "markdown", "id": "bbd38bd3-1816-4bfd-9971-c2084acbfb78", "metadata": {}, "source": [ "### MODEL 9" ] }, { "cell_type": "markdown", "id": "f8fbeb9a-5d6f-4348-816c-f5b2aab82d7e", "metadata": {}, "source": [ "* In model 9 we try to containing the simplest best performing model with Adamax 0.001 and try to introduce the Bidirectional layer to the model.\n", "* Accuracy: 68%" ] }, { "cell_type": "code", "execution_count": 104, "id": "77c4ad03-e817-4836-abd3-4c8e99982d0d", "metadata": {}, "outputs": [], "source": [ "adamax_opt = Adamax(learning_rate = 0.001)" ] }, { "cell_type": "code", "execution_count": 105, "id": "b2fffec0-09d2-455c-8e65-2252a39bf7de", "metadata": {}, "outputs": [], "source": [ "# We are going to build our model with the Sequential API\n", "model9 = Sequential()\n", "\n", "model9.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "#model9.add(LSTM(100, return_sequences=False))\n", "model9.add(Bidirectional(LSTM(100, return_sequences=False)))\n", "\n", "model9.add(Dropout(0.2))\n", "\n", "model9.add(Dense(1, activation='sigmoid')) \n", "\n", "model9.compile(optimizer= adamax_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 106, "id": "51b3556b-c2ac-46e9-ab8d-4076fa560ca2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_9\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_9 (Embedding) (None, 404, 100) 2093600 \n", " \n", " lstm_10 (LSTM) (None, 100) 80400 \n", " \n", " dropout_5 (Dropout) (None, 100) 0 \n", " \n", " dense_9 (Dense) (None, 1) 101 \n", " \n", "=================================================================\n", "Total params: 2174101 (8.29 MB)\n", "Trainable params: 2174101 (8.29 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model9.summary()" ] }, { "cell_type": "code", "execution_count": 107, "id": "5a46365d-29e9-4e02-b0de-b9d190c82c67", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/50\n", "57/57 [==============================] - 16s 223ms/step - loss: 0.6931 - accuracy: 0.5146 - val_loss: 0.6911 - val_accuracy: 0.5780\n", "Epoch 2/50\n", "57/57 [==============================] - 12s 208ms/step - loss: 0.6441 - accuracy: 0.6725 - val_loss: 0.6356 - val_accuracy: 0.6835\n", "Epoch 3/50\n", "57/57 [==============================] - 12s 209ms/step - loss: 0.5578 - accuracy: 0.7397 - val_loss: 0.6029 - val_accuracy: 0.6857\n", "Epoch 4/50\n", "57/57 [==============================] - 13s 222ms/step - loss: 0.4707 - accuracy: 0.7942 - val_loss: 0.6156 - val_accuracy: 0.6835\n", "Epoch 5/50\n", "57/57 [==============================] - 12s 220ms/step - loss: 0.3854 - accuracy: 0.8514 - val_loss: 0.7752 - val_accuracy: 0.6659\n", "Epoch 6/50\n", "57/57 [==============================] - 12s 218ms/step - loss: 0.2752 - accuracy: 0.9125 - val_loss: 0.7378 - val_accuracy: 0.6725\n" ] } ], "source": [ "hist9 = model9.fit(X_train, y_train, epochs=20, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 108, "id": "d3162d4b-d971-4f23-beb9-ebcb1ba2b607", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist9.history['loss'])\n", "plt.plot(hist9.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 123, "id": "860bbb10-745f-45b6-862e-a6bbe850443c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 87ms/step - loss: 0.6029 - accuracy: 0.6857\n" ] } ], "source": [ "loss, accuracy = model9.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "6105aeed-b853-47a3-9857-99142c9bc72b", "metadata": {}, "source": [ "We can see an improvement in the model performance after introducing the Bidirectional Layers." ] }, { "cell_type": "markdown", "id": "f330eb61-22ce-4ac4-b3ba-962a59b53351", "metadata": {}, "source": [ "### MODEL 10 " ] }, { "cell_type": "markdown", "id": "05e4a711-13bd-43ad-9de1-c7ee40f56bda", "metadata": {}, "source": [ "* Here we try a variation of the best model yet with a change in the LSTM bidirectional layers to 50.\n", "* The performace drops slightly, we will remaind with the previous version.\n", "* Accuracy: 68%" ] }, { "cell_type": "code", "execution_count": 134, "id": "a6f401c3-f308-4219-bbe9-36d2cdab426e", "metadata": {}, "outputs": [], "source": [ "adamax_opt = Adamax(learning_rate = 0.001)" ] }, { "cell_type": "code", "execution_count": 135, "id": "ca037ebf-af78-4dc5-9497-227daf0c9c00", "metadata": {}, "outputs": [], "source": [ "model10 = Sequential()\n", "\n", "model10.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " mask_zero = True,\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model10.add(Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True)))\n", "\n", "model10.add(Dropout(0.2)) \n", "\n", "model10.add(Bidirectional(tf.keras.layers.LSTM(50, return_sequences=False)))\n", "#model5.add(LSTM(100, return_sequences=False))\n", "\n", "model10.add(Dropout(0.2)) \n", "\n", "model10.add(Dense(1, activation='sigmoid')) \n", "\n", "model10.compile(optimizer=adamax_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 136, "id": "222857a2-9cad-4e8f-8c54-5aae530cc241", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_15\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_15 (Embedding) (None, 404, 100) 2093600 \n", " \n", " bidirectional_7 (Bidirecti (None, 404, 100) 60400 \n", " onal) \n", " \n", " dropout_13 (Dropout) (None, 404, 100) 0 \n", " \n", " bidirectional_8 (Bidirecti (None, 100) 60400 \n", " onal) \n", " \n", " dropout_14 (Dropout) (None, 100) 0 \n", " \n", " dense_15 (Dense) (None, 1) 101 \n", " \n", "=================================================================\n", "Total params: 2214501 (8.45 MB)\n", "Trainable params: 2214501 (8.45 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model10.summary()" ] }, { "cell_type": "code", "execution_count": 137, "id": "1b9ab75b-1b2f-4307-a480-5914056fd864", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/20\n", "57/57 [==============================] - 41s 490ms/step - loss: 0.6856 - accuracy: 0.6098 - val_loss: 0.6623 - val_accuracy: 0.6659\n", "Epoch 2/20\n", "57/57 [==============================] - 23s 403ms/step - loss: 0.5616 - accuracy: 0.7226 - val_loss: 0.5920 - val_accuracy: 0.6791\n", "Epoch 3/20\n", "57/57 [==============================] - 23s 403ms/step - loss: 0.4203 - accuracy: 0.8123 - val_loss: 0.6416 - val_accuracy: 0.6813\n", "Epoch 4/20\n", "57/57 [==============================] - 23s 402ms/step - loss: 0.2511 - accuracy: 0.9114 - val_loss: 0.7972 - val_accuracy: 0.6440\n", "Epoch 5/20\n", "57/57 [==============================] - 23s 411ms/step - loss: 0.1336 - accuracy: 0.9582 - val_loss: 0.9629 - val_accuracy: 0.6220\n" ] } ], "source": [ "hist10 = model10.fit(X_train, y_train, epochs=20, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 138, "id": "3b2cc649-d0e0-408d-b594-c030fac2af59", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist10.history['loss'])\n", "plt.plot(hist10.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 139, "id": "ab2fd31f-b014-4c2a-9243-d45b1b5f8f02", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 2s 101ms/step - loss: 0.5920 - accuracy: 0.6791\n" ] } ], "source": [ "loss, accuracy = model10.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "9114b6f5-b663-4f07-be6c-c211c7ded06b", "metadata": {}, "source": [ "### MODEL 11" ] }, { "cell_type": "markdown", "id": "e0cdc305-cd75-4df6-8c82-99187fc24210", "metadata": {}, "source": [ "* Another variation of model 5, here we are trying to adress the overfitting problem by reducing the learning rate.\n", "* Accuracy: 0.71%" ] }, { "cell_type": "code", "execution_count": 149, "id": "5c09ba68-550c-4828-b24a-ca98bb36bf74", "metadata": {}, "outputs": [], "source": [ "adamax_opt = Adamax(learning_rate = 0.0001)" ] }, { "cell_type": "code", "execution_count": 150, "id": "1ba03801-df7a-4133-a95e-0b31108a5784", "metadata": {}, "outputs": [], "source": [ "model11 = Sequential()\n", "\n", "model11.add(Embedding(total_words, # number of words to process as input\n", " 100, # output representation\n", " mask_zero = True,\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "model11.add(Bidirectional(tf.keras.layers.LSTM(100, return_sequences=True)))\n", "\n", "model11.add(Dropout(0.2)) \n", "\n", "model11.add(Bidirectional(tf.keras.layers.LSTM(100, return_sequences=False)))\n", "#model11.add(LSTM(100, return_sequences=False))\n", "\n", "model11.add(Dropout(0.2)) \n", "\n", "model11.add(Dense(1, activation='sigmoid')) \n", "\n", "model11.compile(optimizer=adamax_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 151, "id": "df1c18bc-d115-49ab-9ebb-d168bc7916ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_19\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_19 (Embedding) (None, 404, 100) 2093600 \n", " \n", " bidirectional_15 (Bidirect (None, 404, 200) 160800 \n", " ional) \n", " \n", " dropout_21 (Dropout) (None, 404, 200) 0 \n", " \n", " bidirectional_16 (Bidirect (None, 200) 240800 \n", " ional) \n", " \n", " dropout_22 (Dropout) (None, 200) 0 \n", " \n", " dense_19 (Dense) (None, 1) 201 \n", " \n", "=================================================================\n", "Total params: 2495401 (9.52 MB)\n", "Trainable params: 2495401 (9.52 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model11.summary()" ] }, { "cell_type": "code", "execution_count": 152, "id": "e874b88e-0c38-459b-8c42-2f10c1717828", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/20\n", "57/57 [==============================] - 50s 659ms/step - loss: 0.6930 - accuracy: 0.5283 - val_loss: 0.6927 - val_accuracy: 0.5319\n", "Epoch 2/20\n", "57/57 [==============================] - 31s 549ms/step - loss: 0.6920 - accuracy: 0.5795 - val_loss: 0.6923 - val_accuracy: 0.5253\n", "Epoch 3/20\n", "57/57 [==============================] - 31s 550ms/step - loss: 0.6907 - accuracy: 0.5894 - val_loss: 0.6914 - val_accuracy: 0.5341\n", "Epoch 4/20\n", "57/57 [==============================] - 31s 547ms/step - loss: 0.6882 - accuracy: 0.5867 - val_loss: 0.6890 - val_accuracy: 0.5714\n", "Epoch 5/20\n", "57/57 [==============================] - 31s 539ms/step - loss: 0.6813 - accuracy: 0.6170 - val_loss: 0.6808 - val_accuracy: 0.5802\n", "Epoch 6/20\n", "57/57 [==============================] - 31s 553ms/step - loss: 0.6519 - accuracy: 0.6714 - val_loss: 0.6339 - val_accuracy: 0.7033\n", "Epoch 7/20\n", "57/57 [==============================] - 31s 553ms/step - loss: 0.5884 - accuracy: 0.7083 - val_loss: 0.6056 - val_accuracy: 0.6791\n", "Epoch 8/20\n", "57/57 [==============================] - 32s 558ms/step - loss: 0.5603 - accuracy: 0.7138 - val_loss: 0.6003 - val_accuracy: 0.7121\n", "Epoch 9/20\n", "57/57 [==============================] - 32s 560ms/step - loss: 0.5395 - accuracy: 0.7331 - val_loss: 0.5960 - val_accuracy: 0.7011\n", "Epoch 10/20\n", "57/57 [==============================] - 33s 572ms/step - loss: 0.5202 - accuracy: 0.7463 - val_loss: 0.5958 - val_accuracy: 0.7121\n", "Epoch 11/20\n", "57/57 [==============================] - 32s 557ms/step - loss: 0.4953 - accuracy: 0.7639 - val_loss: 0.6112 - val_accuracy: 0.7055\n", "Epoch 12/20\n", "57/57 [==============================] - 32s 556ms/step - loss: 0.4686 - accuracy: 0.7931 - val_loss: 0.6228 - val_accuracy: 0.7077\n", "Epoch 13/20\n", "57/57 [==============================] - 33s 574ms/step - loss: 0.4359 - accuracy: 0.8079 - val_loss: 0.6151 - val_accuracy: 0.6857\n" ] } ], "source": [ "hist11 = model11.fit(X_train, y_train, epochs=20, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 153, "id": "01f02f4c-5383-4e1b-a87a-748ba3c78ab0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiwAAAGdCAYAAAAxCSikAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABVqklEQVR4nO3dd3wUdf7H8dfuphKSUEIahN6kY4BIV4kEVBRRBE5EUFARFYwN9AArqPcTORVBOBA8FVFEReBQzEk7ShCk9xpaAgHSSdvd3x8LwUgoG5LMJnk/H495ZGd2ZvazK2bf+c73+x2T3W63IyIiIuLCzEYXICIiInItCiwiIiLi8hRYRERExOUpsIiIiIjLU2ARERERl6fAIiIiIi5PgUVERERcngKLiIiIuDw3owsoCjabjRMnTuDr64vJZDK6HBEREbkOdrud1NRUQkNDMZuv3oZSJgLLiRMnCAsLM7oMERERKYSjR49So0aNq+5TJgKLr68v4HjDfn5+BlcjIiIi1yMlJYWwsLC87/GrKROB5eJlID8/PwUWERGRUuZ6unOo062IiIi4PAUWERERcXkKLCIiIuLyFFhERETE5SmwiIiIiMtTYBERERGXp8AiIiIiLk+BRURERFyeAouIiIi4PAUWERERcXkKLCIiIuLyFFhERETE5ZWJmx+KiIhIMTl3GLZ9CxnnoMcEw8pQYBEREZH8Ms7CjgWw9Vs4us6xzeIBXV6AClUMKUmBRURERCDnPOz5D2z9BvYvA1vuhSdMUKcLtHgQ3LwMK0+BRUREpLyyWeHQSkdI2fUTZKdeei64ObToB83uB79Q42q8oFCdbqdMmULt2rXx8vIiIiKC2NjYq+4/efJkGjVqhLe3N2FhYTz33HNkZmbe0DlFRESkEOx2OLEZfn4VJjWBf/eGLV85wop/Tej8PDy1Hp5cDR2ecYmwAoVoYZk3bx7R0dFMmzaNiIgIJk+eTFRUFHv27CEwMPCy/b/66itGjx7NrFmz6NChA3v37mXw4MGYTCYmTZpUqHOKiIiIky52nt36LSTuubTdqxI0vc/RmhIWAWbXHEBsstvtdmcOiIiIoG3btnz88ccA2Gw2wsLCeOaZZxg9evRl+z/99NPs2rWLmJiYvG3PP/8869evZ/Xq1YU651+lpKTg7+9PcnIyfn5+zrwdERGRsqugzrMAFk9o1NPRL6X+HeDmYUh5znx/O9XCkp2dzcaNGxkzZkzeNrPZTGRkJGvXri3wmA4dOvDFF18QGxtLu3btOHjwIEuWLOHhhx8u9DlFRETkCq6n8+xNvcDL39AyneVUYElMTMRqtRIUFJRve1BQELt37y7wmL/97W8kJibSqVMn7HY7ubm5PPnkk7zyyiuFPmdWVhZZWVl56ykpKc68DRERkbLlYufZbd/CzoUu3Xm2sIp9lNDy5cuZMGECn3zyCREREezfv5+RI0fy5ptvMnbs2EKdc+LEibz++utFXKmIiEgpYrfDyS2OkLJtPqTFX3rOvya06AvNH4TAxsbVWIScCiwBAQFYLBYSEhLybU9ISCA4OLjAY8aOHcvDDz/M0KFDAWjevDnp6ek8/vjjvPrqq4U655gxY4iOjs5bT0lJISwszJm3IiIiUjqV8s6zheVUYPHw8CA8PJyYmBh69+4NODrIxsTE8PTTTxd4TEZGBua/fGgWiwUAu91eqHN6enri6enpTOkiIiKlV8ZZ2PG9o1/KnzvPunlBwx6OkFI/0rDOsyXB6UtC0dHRPPLII7Rp04Z27doxefJk0tPTGTJkCACDBg2ievXqTJw4EYBevXoxadIkWrdunXdJaOzYsfTq1SsvuFzrnCIiIuVOvs6zv4It58ITpbvzbGE5HVj69evH6dOnGTduHPHx8bRq1YqlS5fmdZqNi4vL16Ly97//HZPJxN///neOHz9OtWrV6NWrF2+//fZ1n1NERKRcsNng8IWZZ8to59nCcnoeFlekeVhERKTUO3sIvn8y/yWfMth59s+KbR4WERERKWJ2O2z+Ev7zMmSngUdFaN63zHaeLSwFFhEREaOkn4FFIx03HgSo1RHumwaVahpblwtSYBERETHC/l/hh6cgLQHM7nD7q9DhWTBbjK7MJSmwiIiIlKSc87BsHMROd6wHNIL7Z0BIS2PrcnEKLCIiIiXlxGZY8PilCd/aPQF3vA7u3oaWVRoosIiIiBQ3mxX+Nxl+m+iYT6ViMPSe4pjsTa6LAouIiEhxOncEvn8C4tY61m+6B3r9EypUMbauUkaBRUREpDjY7bDla1jyomMCOA9fuPM9aDkATCajqyt1FFhERESKWsZZWPQc7PzBsR52C/T5FCrXNrKqUk2BRUREpCgd+K9juHLqSTC7wa1joNNzGq58gxRYREREikLOefj1dVg/1bFetYFjuHJoa2PrKiMUWERERG7Uya2wYBic3u1YbzsM7ngDPCoYW1cZosAiIiJSWDYrrPkI/vuWY7iyTyDcOwUadje6sjJHgUVERKQwkuLg++FwZLVjvfHdjuHKPgHG1lVGKbCIiIg4w26Hbd/C4uchK8Vxd+Ue70DrgRquXIwUWERERK7X+XOwKBp2LHCs12jnGK5cpa6xdZUDCiwiIiLX4+Byx3DllONgssCto6FTNFj0VVoS9CmLiIhcTU4mxLwB66Y41qvUgz4zoEa4sXWVMwosIiIiVxK/3TFc+dROx3qbR6H7W+DhY2xd5ZACi4iIyF/ZbI4WlZg3wJoNPtXgno+hUQ+jKyu3FFhERET+LPkYfP8kHF7lWG/YE+75CCpWM7auck6BRURE5KJt82FxNGQmg3sF6DERbn5Ew5VdgAKLiIjI+STHvCrb5zvWq7eBPtOhaj1Dy5JLFFhERKR8O7TSMWNtyjHHcOWuL0HnFzRc2cXov4aIiJRPuVnw3zdhzceA3TH5W58ZUKON0ZVJARRYRESkfMnNhp0/wupJl4Yr3/wIRE0Az4rG1iZXpMAiIiLlQ8pJ+H0WbJwN6acc2ypUdQxXbnynoaXJtSmwiIhI2WW3Q9xaiJ0Ou34CW65ju28ItHnMMRGcT1Vja5TrosAiIiJlT3aG447KsTMgYdul7bU6Qrth0PhusLgbV584TYFFRETKjrOH4PeZsOnfkJnk2ObmDS0edASV4OaGlieFp8AiIiKlm80GB39ztKbsXQrYHdsr1XKElNYDwbuyoSXKjVNgERGR0ikzBbbMdfRPObP/0vZ63SDiCagfCWaLcfVJkVJgERGR0uX0HkdI2fI1ZKc5tnn4QuuHoO0wCKhvbH1SLBRYRETE9dmssOc/jqByaMWl7dUaOy77tOgHnr7G1SfFToFFRERcV8ZZ2DQHNsyC5DjHNpMZGt0J7R6HOl10Y8JywlyYg6ZMmULt2rXx8vIiIiKC2NjYK+576623YjKZLlvuuuuuvH0GDx582fM9evQoTGkiIlIWnNgMP4yASTfBr685wop3Fej0HIzcAv2/hLpdFVbKEadbWObNm0d0dDTTpk0jIiKCyZMnExUVxZ49ewgMDLxs/wULFpCdnZ23fubMGVq2bEnfvn3z7dejRw8+++yzvHVPT09nSxMRkdIsNxt2LXRc9jm6/tL2kJbQ7glodj+4exlXnxjK6cAyadIkhg0bxpAhQwCYNm0aixcvZtasWYwePfqy/atUqZJv/euvv6ZChQqXBRZPT0+Cg4OdLUdEREq7lJOO6fI3fgZpCY5tZndo2ttx2adGW7WkiHOBJTs7m40bNzJmzJi8bWazmcjISNauXXtd55g5cyb9+/fHx8cn3/bly5cTGBhI5cqVuf3223nrrbeoWrXg6ZKzsrLIysrKW09JSXHmbYiIiNHsdkcryvpPHa0qF6fMrxjsmC4/fDD4BhlaorgWpwJLYmIiVquVoKD8/4iCgoLYvXv3NY+PjY1l+/btzJw5M9/2Hj160KdPH+rUqcOBAwd45ZVX6NmzJ2vXrsViuXwM/cSJE3n99dedKV1ERFxBznnYNh9iP4X4P02ZX7O9ozXlpl6aMl8KVKKjhGbOnEnz5s1p165dvu39+/fPe9y8eXNatGhBvXr1WL58Od26dbvsPGPGjCE6OjpvPSUlhbCwsOIrXEREbsy5I7DhX/DHv+H8Occ2Ny9o3tcRVEJaGFufuDynAktAQAAWi4WEhIR82xMSEq7Z/yQ9PZ2vv/6aN95445qvU7duXQICAti/f3+BgcXT01OdckVEXF1qPOxeBLsWwcHlXJoyv6ZjgrfWA6FClaudQSSPU4HFw8OD8PBwYmJi6N27NwA2m42YmBiefvrpqx777bffkpWVxcCBA6/5OseOHePMmTOEhIQ4U56IiBjtzIELIeUnOLYh/3P1bne0pjTorinzxWlOXxKKjo7mkUceoU2bNrRr147JkyeTnp6eN2po0KBBVK9enYkTJ+Y7bubMmfTu3fuyjrRpaWm8/vrr3H///QQHB3PgwAFeeukl6tevT1RU1A28NRERKXZ2u6MvysWQcmpn/udrtIXGdzv6plStZ0yNUiY4HVj69evH6dOnGTduHPHx8bRq1YqlS5fmdcSNi4vDbM4/H92ePXtYvXo1v/zyy2Xns1gsbN26lTlz5pCUlERoaCjdu3fnzTff1GUfERFXZLM6RvjsWgS7f4KkuEvPmSxQp7MjpDS+C/xCjatTyhST3W63G13EjUpJScHf35/k5GT8/PyMLkdEpOzJzYJDKx2tKHuWQPrpS8+5eUP9bo6Q0jBK/VLkujnz/a17CYmISMGy0mD/MkdLyr5fIOtPc155+UPDHo5LPfVuBw+fK59HpAgosIiIyCXpZ2Dvfxwh5cB/wXppkk4qBl3oj3I31O6s+VKkRCmwiIiUd8nHYPdix+WeI/8Du+3Sc1XqXuo0W70NmAt1z1yRG6bAIiJSHp3e65gSf/ciOPFH/ueCm0PjXo6WlMAmuo+PuAQFFhGR8sBudwSTXT85Qkri3j89aYKat1y63FO5tlFVilyRAouISFllzYW4NReGHy+GlGOXnjO7Q92ujks9je6EioHG1SlyHRRYRETKkpxMOPibI6TsWQLnz156zt0HGtzhCCkN7nCM9BEpJRRYRETKguRjsPYT2PQ5ZKde2u5dxdGCctPdUPdWcPc2rESRG6HAIiJSmiXshDUfwrZvwZbr2OZX/VJ/lJodwKJf9VL66V+xiEhpY7fDkTXwv3/Cvp8vba/dGTqNgnrdNLJHyhwFFhGR0sJmgz2LHUEl707IJmhyD3QcCdXDDS1PpDgpsIiIuLrcLNjytePSz5n9jm0WT2j1N+jwjO6CLOWCAouIiKvKTIbfZ8G6qZCW4Njm5Q9th0LEkxqKLOWKAouIiKtJOQnrPoHfP7s04sevOtzyFIQ/Ap6+xtYnYgAFFhERV3F6L6z5J2yZB7Ycx7ZqjR39U5o9AG4extYnYiAFFhERox2NdXSk3b0YsDu21ezgCCoNuuuGgyIosIiIGMNmg32/OIJK3JpL2xvf7QgqYe2Mq03EBSmwiIiUpNxs2D4f/vchnN7l2GZ2h5b9oMNIqNbQ2PpEXJQCi4hISchKhY1zHJ1pU447tnn4Qpshjs60fiHG1ifi4hRYRESKU9opWD8NNvzLMUwZoGIQ3DIc2jyqGxCKXCcFFhGR4nDmAKz5CDZ/BdYsx7aq9aHDs9CyP7h5GlufSCmjwCIiUpSOb4L/TYadC8kb8VO9jeMeP43u0ogfkUJSYBERuVF2OxyIgdWT4fCqS9sbRDlG/NTqoJsRitwgBRYRkcKy5sKO7x1DkxO2ObaZ3aB5X8c9foKaGlufSBmiwCIicj1sNkg9CecOO5azB2Drt5Ac53je3QfCBzs601YKM7BQkbJJgUVE5KLMFEg6cimU/HlJigNr9uXHVAhw3Iiw7WNQoUqJlitSniiwiEj5Yc2FlGNw7gqh5PzZqx9vdoNKNaFSLahcG6rf7Lj84+5d3JWLlHsKLCJSdtjtcP7chRaRAkJJ0lGwW69+jgoBUPlCIPnr4hsKFv3aFDGC/s8TkdIlNxuSj8K5QwW0lByBrOSrH2/xdASSSgWFklrg6VvMb0BECkOBRURcW+I+WPuxYyK2c4cd09rbbVc/pmLwlVtJKgZrLhSRUkiBRURc15E1MLf/pSntL3KvcIUWktqOPiYeFUq8VBEpXgosIuKadv4I3w1zTGtfoy20e/xSKPGpponYRMoZBRYRcT3rpsLSMYAdGt8NfWao1USknFNgERHXYbPBsrGOPisAbYdCz/fAbDG2LhExXKF6nk2ZMoXatWvj5eVFREQEsbGxV9z31ltvxWQyXbbcddddefvY7XbGjRtHSEgI3t7eREZGsm/fvsKUJiKlVW4WLBh6KaxEvgZ3/p/CiogAhQgs8+bNIzo6mvHjx7Np0yZatmxJVFQUp06dKnD/BQsWcPLkybxl+/btWCwW+vbtm7fPe++9x4cffsi0adNYv349Pj4+REVFkZmZWfh3JiKlx/kk+Hcf2P4dmN3hvunQ6Tn1UxGRPCa73W535oCIiAjatm3Lxx87/gqy2WyEhYXxzDPPMHr06GseP3nyZMaNG8fJkyfx8fHBbrcTGhrK888/zwsvvABAcnIyQUFBzJ49m/79+1/znCkpKfj7+5OcnIyfn58zb0dEjJZ8DL54AE7vAg9f6PdvqHeb0VWJSAlw5vvbqRaW7OxsNm7cSGRk5KUTmM1ERkaydu3a6zrHzJkz6d+/Pz4+PgAcOnSI+Pj4fOf09/cnIiLiiufMysoiJSUl3yIipVDCDvjXHY6wUjEYhixRWBGRAjkVWBITE7FarQQFBeXbHhQURHx8/DWPj42NZfv27QwdOjRv28XjnDnnxIkT8ff3z1vCwnRnVJFS59BKmNUDUk9AQCMY+iuEtDC6KhFxUSU63ePMmTNp3rw57dq1u6HzjBkzhuTk5Lzl6NGjRVShiJSIbfMdfVayUqBmB3jsZ6ikPzxE5MqcCiwBAQFYLBYSEhLybU9ISCA4OPiqx6anp/P111/z2GOP5dt+8Thnzunp6Ymfn1++RURKAbsd/vdP+O4xsOVAk97w8PfgXdnoykTExTkVWDw8PAgPDycmJiZvm81mIyYmhvbt21/12G+//ZasrCwGDhyYb3udOnUIDg7Od86UlBTWr19/zXOKSClis8J/XoZl4xzrtzwFD3wG7l7G1iUipYLTE8dFR0fzyCOP0KZNG9q1a8fkyZNJT09nyJAhAAwaNIjq1aszceLEfMfNnDmT3r17U7Vq1XzbTSYTo0aN4q233qJBgwbUqVOHsWPHEhoaSu/evQv/zkTEdeSchwXDYNdPjvWoCdB+hLE1iUip4nRg6devH6dPn2bcuHHEx8fTqlUrli5dmtdpNi4uDvNf7oS6Z88eVq9ezS+//FLgOV966SXS09N5/PHHSUpKolOnTixduhQvL2P/8rLZ7Nw/bQ3uFjMeFjPuFhPuFjPubn9Zt5jxcPvL+sXn3f68fmlbvvW84x3rHvnWHdtMmo9CSquMszB3ABxdBxYPuG8aNLvf6KpEpJRxeh4WV1Rc87Bk5lhpPHZpkZ3vRvw5DDkCkAlPdwuebuYLiwVP9z89djNfWL/4+E/7XuM4L/c/b7+0r0KTOO3cEfjyAUjcC57+0P9LqNPZ6KpExEU48/2tewldhZvZxIxBbcix2six2sjOtZFjtV9at9rIyb20npVry3ucY7VfeP4v6xeXC8dl/2n/nFwbWRfW/xojHa9rBayGfBYAHm5XDkQXQ463uwVfLzf8vN0dP73cL1v383LHz9uNip5uuFlKdKCalKSTW+DLvpCWAH7VYeB3EHiT0VWJSCmlwHIVbhYzdzQJuvaORcxut2O12QsMOdl/Ck/ZVhtZOTaycq1k5V74meMITpk5l2/L2y/fMVc5Ljd/cMrOdbxuKrlF9l59PCz4Xggwvl7u+Hm5/WW9oMBzad3b3aKWH1e0Pwa+GQTZaRDYFAbOB79Qo6sSkVJMgcUFmUwm3Cwm3CzgjXE3frPbHaEpX7jJucLjXCuZOTbOZ+eSkplLSmYOqZm5pJy/8PMv6+dzHC1F6dlW0rOtxBdysmI3s6ngFh0v97zg4+flTo3K3jQI8qVmlQpYzAo4xWrzV7DwGbDlQp0u0O8L8PI3uioRKeUUWOSKTCYTHm4mPNzM+BbxuXOstgICTQ4p5x2PUzJz861ffJyadeFnZg42O+Ta7JzLyOFcRs51va6Hm5l61SrSIPDCElSRBkG+1KpSQZenbpTdDqv+D/77lmO9eV+49xNw8zC2LhEpExRYxBDuFjNVfDyo4lO4LzO73U56tvVSkMnMydeK8+dWnuSMHA6fSWf/qTSycm3sOpnCrpP5m3TcLSbqBlSkflBFGgb6OoJMYEVqVfXBw01B5pqsubDkBdj4mWO94yjoNh7M+uxEpGgosEipZDKZqOjp6Lgbcp1XG6w2O8fOZbAvIY19p9LYdyqVfQlp7D+VxvkcK3sSUtmTkMpiTuYd42Y2UTvAh4ZBFakf6JvXKlMnwAdPN+Mu17mU7AyY/yjs/Q9ggp7vQcTjRlclImWMhjVLuWez2TmedJ79F0LM3guBZn9CKunZBY/KsphN1Kpa4cKlJUeLTP3AitSrVhEv93IUZNIT4at+cPx3cPOCPjOgyT1GVyUipYQz398KLCJXYLfbOZmc6WiNSUi90DKTyr5TaaRmFjxSymyCmlUqOFpjgirSMMgRaOpVq4i3RxkLMmcPwhf3O356V4YBX0PNW4yuSkRKEQUWkWJkt9s5lZrFvoQ09iY4Asz+Cy0zyecL7vxrMuEYqZTXP8aXZtX9aBxcSv+9Ht8IXz4IGYlQqSY89B1Ua2h0VSJSyiiwiBjAbreTmJbtaI35Sx+ZM+nZBR5z/801eO2eJvh6uZdwtTdg78/w7WDIyYDgFvDQfPAt+fmKRKT0U2ARcTFn0rIuhBhH35i9CWmsP3QGmx3CqngzuV8rwmtVMbrMa9s4BxY9B3Yr1OsGD84Bz6Ie9C4i5YUCi0gpEHvoLM/N28zxpPOYTfD07Q149vb6rjkfjN0OyyfCincd6y3/Bvd8CJZS1DIkIi7Hme9vF/zNKFI+tKtThf+M6sx9ratjs8OHMft4YNpaDiemG11aftYc+PHpS2Gly4vQ+xOFFREpUQosIgby83Lng36t+HBAa3y93Nh8NIk7P1zFNxuO4hKNn1lpMLc/bP4CTGa4+wO4/e+OXsQiIiVIgUXEBdzTMpSlo7oQUacKGdlWXvpuK8O/2MS5K3TWLRFpp2D2XbD/V3Dzhv5fQZtHjatHRMo1BRYRF1G9kjdfDbuFl3s0xs1sYumOeHr8cyWr9yWWfDGJ++BfkXByM1SoCoMXQ6OeJV+HiMgFCiwiLsRiNjH81np8/1RH6lbzISEli4Ez1/Pmop1k5hQ8626ROxoLM7tD0hGoXAceWwY1wkvmtUVErkCBRcQFNa/hz+JnOjPwlpoAzFx9iN5T/see+NTifeFdi2BOLzh/FkJvdoSVqvWK9zVFRK6DAouIi/L2sPBW7+bMfKQNVX082B2fSq+PVzNr9SFstmLokHt4NXwzCHIzoUEUDF4EFasV/euIiBSCAouIi+t2UxBLR3XhtkbVyM618cainTzyWSynUjKL7kVSExx3XLZboWkfRwdbD5+iO7+IyA1SYBEpBar5ejJrcFvevLcpnm5mVu1LJGrySn7eEX/jJ7fmwnePQVoCVLsJ7v0YLG43fl4RkSKkwCJSSphMJh5uX5vFz3aiaagf5zJyeOLfGxn93VbSswq+e/R1WT4BDq8Cj4rQ799qWRERl6TAIlLK1A/05funOvJE17qYTPD1hqPc9eEqNh9Ncv5ke3+GVe87Ht/zIQQ0KNJaRUSKigKLSCnk4WZmTM+b+HJoBCH+Xhw+k8H9U9fwUcw+rNfbITcpDhY87njc7nFodn/xFSwicoMUWERKsQ71Alg6sgt3twjBarPz/rK99Pt0LUfPZlz9wNws+HYwZCY5hi93f6skyhURKTQFFpFSzr+COx8NaM2kB1tS0dON34+co+c/V7Fg07Er34/ol7/D8Y3gVQkenANuniVas4iIsxRYRMoAk8lEn5tr8J+RnWlTqzJpWblEf7OFZ+b+QXJGTv6dt38HsdMdj/tMh0o1S75gEREnKbCIlCFhVSrw9eO38PwdDbGYTSzaepKe/1zJ2gNnHDsk7oOFzzoed34eGkYZV6yIiBNMdpe4h/2NSUlJwd/fn+TkZPz8/IwuR8QlbD6axKiv/+DwmQxMJhjRMYToI8Mxn94NtTvDwz9ovhURMZQz399qYREpo1qFVWLxs53p3zYMu91OnfXjMJ/eTW6FQLh/psKKiJQqCiwiZZiPpxvv3N+CRR0Pcb9lFVa7iSGpT/Lv7eev3CFXRMQFKbCIlHUnt9Bs85sAzK80hFU5jRn74w4em/M7p1OzDC5OROT6KLCIlGXnk+CbR8CaBQ170PfZ/2Pc3U3wcDPz392n6DF5JTG7EoyuUkTkmhRYRMoqux1+HAHnDoF/Teg9FbPFwqOd6rDw6Y40DvblTHo2j835nb//sI3z2VajKxYRuaJCBZYpU6ZQu3ZtvLy8iIiIIDY29qr7JyUlMWLECEJCQvD09KRhw4YsWbIk7/nXXnsNk8mUb2ncuHFhShORi9ZOgd2LwOLhmByuQpW8pxoH+/HDiI481qkOAF+si+Puj1ax/XiyUdWKiFyV04Fl3rx5REdHM378eDZt2kTLli2Jiori1KlTBe6fnZ3NHXfcweHDh5k/fz579uxhxowZVK9ePd9+TZs25eTJk3nL6tWrC/eORATi1sGycY7HUROg+s2X7eLlbmHs3U3492PtCPLz5MDpdO775H98vvZwydYqInIdnB7XOGnSJIYNG8aQIUMAmDZtGosXL2bWrFmMHj36sv1nzZrF2bNnWbNmDe7u7gDUrl378kLc3AgODna2HBH5q7TT8O0QsFuh2QPQduhVd+/coBpLR3ZhzIJtLN0Rz/iFO2ga6kd4rSpXPU5EpCQ51cKSnZ3Nxo0biYyMvHQCs5nIyEjWrl1b4DELFy6kffv2jBgxgqCgIJo1a8aECROwWvNfL9+3bx+hoaHUrVuXhx56iLi4uCvWkZWVRUpKSr5FRACbFRYMhdQTENAQev0TTKZrHlbZx4OpA2/m/ptrYLfDi/O3kpmjPi0i4jqcCiyJiYlYrVaCgoLybQ8KCiI+Pr7AYw4ePMj8+fOxWq0sWbKEsWPH8v777/PWW5fuDhsREcHs2bNZunQpU6dO5dChQ3Tu3JnU1NQCzzlx4kT8/f3zlrCwMGfehkjZteI9OLgc3CvAg5+DZ8XrPtRkMjHu7iZU8/Xk4Ol0PozZV3x1iog4qdhHCdlsNgIDA5k+fTrh4eH069ePV199lWnTpuXt07NnT/r27UuLFi2IiopiyZIlJCUl8c033xR4zjFjxpCcnJy3HD16tLjfhojr2x8DK951PL57MgTe5PQp/Cu481bvZgB8uvIg246pE66IuAanAktAQAAWi4WEhPzzNiQkJFyx/0lISAgNGzbEYrHkbbvpppuIj48nOzu7wGMqVapEw4YN2b9/f4HPe3p64ufnl28RKdeSj8F3QwE7hA+Blv0KfaqopsHc3SIEq83Oi/O3kJ1rK7o6RUQKyanA4uHhQXh4ODExMXnbbDYbMTExtG/fvsBjOnbsyP79+7HZLv3S27t3LyEhIXh4eBR4TFpaGgcOHCAkJMSZ8kTKJ2uOo5Pt+bMQ0hJ6vHPDp3z9nqZUruDO7vhUpq04UARFiojcGKcvCUVHRzNjxgzmzJnDrl27GD58OOnp6XmjhgYNGsSYMWPy9h8+fDhnz55l5MiR7N27l8WLFzNhwgRGjBiRt88LL7zAihUrOHz4MGvWrOG+++7DYrEwYMCAIniLImXcsvFwLBY8/aHvHHD3uuFTVq3oyWv3NAXgo//uY098wf3JRERKitPDmvv168fp06cZN24c8fHxtGrViqVLl+Z1xI2Li8NsvpSDwsLC+Pnnn3nuuedo0aIF1atXZ+TIkbz88st5+xw7dowBAwZw5swZqlWrRqdOnVi3bh3VqlUrgrcoUobt/BHWTXE8vm8qVKlTZKe+p2UoP205ya+7Enhp/ha+G94BN4smxxYRY5jsZeCWrSkpKfj7+5OcnKz+LFJ+nDkA02+FrBTo8Cx0f7PIXyIhJZPISStIzczllTsb83iXekX+GiJSfjnz/a0/l0RKo5zz8M0gR1ip2R66jSuWlwny82LsXU0AeP+XvRw8nVYsryMici0KLCKl0ZIXIWE7+FSDBz4Di3uxvVTfNjXo3CCArFwbo7/bhs1W6htlRaQUUmARKW3++BL++Ddggvv/BX7FO5rOZDIx4b7mVPCwEHv4LF+sP1KsryciUhAFFpHSJH47LH7e8fi2V6HurSXysmFVKjC6p+MO6u/8ZzdHz2aUyOuKiFykwCJSWmSmOPqt5J6H+pHQ+fkSffmBEbVoV7sKGdlWXvl+G2Wgv76IlCIKLCKlgd0OC5+BswfArwbcNx3MJfu/r9ls4p37m+PpZmbVvkS+/f1Yib6+iJRvCiwipcH6T2HnD2B2h76zwaeqIWXUrVaR6DsaAvDm4p0kpGQaUoeIlD8KLCKu7ugG+OXvjsfd34KwtoaW81inOrSs4U9qZi6vfr9dl4ZEpEQosIi4soyz8O1gsOVAk94Q8YTRFeFmMfPeAy1xt5j4dVcCP209aXRJIlIOKLCIuCqbDRYMg5RjUKUe3PMRmExGVwVAo2Bfnr6tAQCvLdzBmbQsgysSkbJOgUXEVa16H/b/Cm5e8ODn4OVat50Yfms9Ggf7cjY9m/ELdxhdjoiUcQosIq7o4HJYPsHx+K5JENzM0HIK4uFm5h8PtMRiNrFo60l+3hFvdEkiUoYpsIi4mpST8N1QsNug9UBo/ZDRFV1R8xr+PN6lLgB//2E7yRk5BlckImWVAouIK7HmwPwhkH4agprBnf9ndEXXNLJbA+pW8+F0ahZvLd5pdDkiUkYpsIi4kpg3IG4tePg6+q24extd0TV5uVv4xwMtMJng243HWLH3tNEliUgZpMAi4ip2L4Y1Hzoe954CVesZW48TwmtVYXCH2gC8smAbaVm5xhYkImWOAouIKzh7CL4f7nh8y1PQ5F5j6ymEF6MaEVbFm+NJ53n3P7uNLkdEyhgFFhGj5WTCt49AVjLUaAeRrxtdUaFU8HDj3T4tAPj3uiOsO3jG4IpEpCxRYBEx2tLRcHILeFeBvp+Bm4fRFRVah/oBDGhXE4DR323lfLbV4IpEpKxQYBEx0pZ5sPEzwAT3zwD/GkZXdMPG3NmYYD8vDp/JYNKyPUaXIyJlhAKLiFFO7YJFoxyPu74E9SMNLaeo+Hm5M6GPY6K7masP8UfcOYMrEpGyQIFFxAhZafDNIMjJgLq3QteXja6oSN3eOIg+ratjs8NL87eSlatLQyJyYxRYREpaajzM7Q+Je8E3BPr8C8wWo6sqcmPvbkJARQ/2nUrj4//uN7ocESnlFFhEStLeX2BqBzi8CtwrQN/ZULGa0VUVi8o+Hrx5r+PS0NTlB9hxItngikSkNFNgESkJuVmwdAx81RcyzkBwc3hiJdS8xejKilXP5iH0bBZMrs3OS/O3kmO1GV2SiJRSCiwixS1xH/yrG6z7xLEeMRyGxkBAA2PrKiGv39uUShXc2XEihekrDxpdjoiUUgosIsXFboc/voBPu0D8NqhQFQbMg57vgJun0dWVmEBfL8b3agLAP3/dx/5TqQZXJCKlkQKLSHHITIbvhsKPIxwjgep0gSf/B416GF2ZIXq3qs5tjaqRbbXx4vytWG12o0sSkVJGgUWkqB37HaZ1hu3zwWSBbuPg4R/AL8ToygxjMpmY0Kc5vp5u/BGXxOw1h40uSURKGQUWkaJis8HqD2BWFCQdgUo14dGfofPzZXLYsrNC/L155a6bAPjHz7s5cibd4IpEpDRRYBEpCqnx8MV98OtrYMuFpn3gydUQ1tboylxK/7ZhdKhXlcwcGy9/txWbLg2JyHVSYBG5URfnVjm43DG3yj0fwwOzwMvf6Mpcjslk4p0+LfB2t7Du4FnmbogzuiQRKSUUWEQKKzcLlr5yaW6VoObw+Aq4+WEwmYyuzmXVrFqBF6MaATBxyW5OJJ03uCIRKQ0UWEQKI3E//CsS1k1xrEcMh6G/QrWGxtZVSjzSoTbhtSqTlpXLK99vw27XpSERubpCBZYpU6ZQu3ZtvLy8iIiIIDY29qr7JyUlMWLECEJCQvD09KRhw4YsWbLkhs4pYgi7Hf748sLcKlvBu8qluVXcvYyurtSwmE28e38LPNzMLN9zmgWbjhtdkoi4OKcDy7x584iOjmb8+PFs2rSJli1bEhUVxalTpwrcPzs7mzvuuIPDhw8zf/589uzZw4wZM6hevXqhzyliiMyUC3OrPAU56VC7Mwwvv3Or3Kj6gRUZFemY7feNRTs5lZppcEUi4spMdifbYiMiImjbti0ff/wxADabjbCwMJ555hlGjx592f7Tpk3jH//4B7t378bd3b1IzvlXKSkp+Pv7k5ycjJ+fnzNvR+T6HNsI84c4hiubLHDbK9DpOQ1XvkG5Vhv3fbKGbceTiWoaxLSB4ZjU/0ek3HDm+9upFpbs7Gw2btxIZGTkpROYzURGRrJ27doCj1m4cCHt27dnxIgRBAUF0axZMyZMmIDVai30ObOyskhJScm3iBQLmw1WT4ZZ3R1hxb8mPLoUurygsFIE3Cxm3r2/BW5mEz/vSGDJtnijSxIRF+VUYElMTMRqtRIUFJRve1BQEPHxBf+iOXjwIPPnz8dqtbJkyRLGjh3L+++/z1tvvVXoc06cOBF/f/+8JSwszJm3IXJ98uZWGX9hbpX74MlVENbO6MrKlCahfjx1W30Axi/cztn0bIMrEhFXVOyjhGw2G4GBgUyfPp3w8HD69evHq6++yrRp0wp9zjFjxpCcnJy3HD16tAgrFgH2LYOpHf80t8pH8MBn4F3J6MrKpKdvq0+jIF8S07J546cdRpcjIi7IqcASEBCAxWIhISEh3/aEhASCg4MLPCYkJISGDRtisVxqPr/pppuIj48nOzu7UOf09PTEz88v3yJSJC7OrfLlA5CRCEHN4PHlcPMgza1SjDzczLz3QAvMJvhh8wlidiVc+yARKVecCiweHh6Eh4cTExOTt81msxETE0P79u0LPKZjx47s378fm82Wt23v3r2EhITg4eFRqHOKFIvE/TDzjktzq7R7AobGQLVGxtZVTrQMq8SwznUBeOX7bSSfzzG4IhFxJU5fEoqOjmbGjBnMmTOHXbt2MXz4cNLT0xkyZAgAgwYNYsyYMXn7Dx8+nLNnzzJy5Ej27t3L4sWLmTBhAiNGjLjuc4oUK7sdNn/lmFvl5JYLc6t8DXe+p7lVSthzdzSkToAPCSlZTFyyy+hyRMSFuDl7QL9+/Th9+jTjxo0jPj6eVq1asXTp0rxOs3FxcZjNl3JQWFgYP//8M8899xwtWrSgevXqjBw5kpdffvm6zylSbDJTYHE0bPvWsV67M/SZDn6hxtZVTnm5W3j3/hY8+Olavt5wlLtbhNKpQYDRZYmIC3B6HhZXpHlYpFCObYTvHoVzhy/MrTIGOkVruLILGP/jduasPUKNyt78PKoLPp5O/20lIqVAsc3DIlIm/HlulXOHHXOrDPkPdHlRYcVFvNSjMdUreXPs3Hn+8fMeo8sRERegwCJFJ+UEHPsdTu+FtFOQ64LzaaQmwBd9Ls2t0qS3Y26VmhFGVyZ/4uPpxjv3NwdgztrDbDh81uCKRMRoameVwrHbIXEfxK2BI2sdP5PiLt/PvQJ4VXLMX3I9P738Lz0u6g6v+5bB9086hiu7eUPPdzVc2YV1blCNfm3CmPf7UV6ev5UlIzvj5a4WMJHySoFFro8113F34ri1cGQNxK1zfPH/mckMvqGQlQpZyY5tORmOJfWE86/p5nUDYcf7UhDJzYKYN2Ct415VBDWDB2ZpuHIp8MpdN7F87ykOJqYzYMY6/q9vS+pVq2h0WSJiAHW6lYJlZ8Dx3y+0nqyFYxsgOy3/Pm5eUL0N1GoPNds7pqz39HU8Z7NCZjJkJsH5pD/9LGhbAftwg/8sLR6XwkxOJiRfaP1p9zjc8aaGK5cia/Yn8sS/N5KalYunm5kXoxrxaMc6mM1qGRMp7Zz5/lZgEYeMs3B0/YXWk7VwYjPY/jJxl5c/hN1yIaB0gNBW4OZZ9LXYbJCV4ggwmcnXCDd//ZkMduvl5/SuDPd+Ao3vLPp6pdidSDrPy99tZdU+R6teu9pV+EffFtSq6mNwZSJyIxRY5NqSj//p8s5aOLXz8n18QxwtJ7U6OH4GNgGzi/fTttsdLUF/DjHZaVCjHfhUNbg4uRF2u525sUd5e/FO0rOteLtbGHNnYwZG1FJri0gppcAi+V1vB9mqDS61ntS8BSrXVodUcTlHz2bw0vytrD14BoD2davy3gMtCKtSweDKRMRZCizlnTUX4rdc6n8StxYyzuTfx2SG4BaXWk9qtoeK1YypV8RJNpudL9YfYeKS3ZzPseLjYeHVu5owoF0YJoVskVJDgaW8yddBdg0c3QA56fn3uVoHWZFS6siZdF78diuxF+Zp6dwggHfvb0FoJW+DKxOR66HAUtbZ7XAgBg6uML6DrIjBbDY7n605zHtLd5OVa8PX042xvZrQN7yGWltEXJwCS1n3y99hzUf5t5XGDrIiRejA6TRe+HYLf8QlAXBbo2q8c38Lgvw0hF3EVSmwlGVbvobvn3A8bvk3qNPZEVDUQVYEq83Ov1Yd5P1le8nOteHn5cbr9zald6vqam0RcUEKLGXV8Y0wqydYs6DzC9BtrNEVibikfQmpvPDtFrYcc8y4fEeTICbc15xqvrosKuJKdLfmsig1Hr5+yBFWGvaE2141uiIRl9UgyJfvhnfgxahGuFtMLNuZQPcPVvDTlhOUgb/RRMolBZbSIDcL5j0MqSchoBH0ma7+KSLX4GYxM+K2+ix8uhNNQvw4l5HDM3P/YMRXmziTlmV0eSLiJH3ruTq7HRZHw7FYx8ifAXPBqwxf9hIpYjeF+PHj0x0Z2a0BbmYTS7bF0/2DlSzdftLo0kTECQosri52OvzxhWOitwdmQdV6RlckUuq4W8w8d0dDfhjRkcbBvpxJz+bJLzbx7Nw/OJeebXR5InIdFFhc2cEVsHSM4/Edb0D9SGPrESnlmlX358enO/L0bfWxmE0s3HKC7pNX8uvOBKNLE5FrUGBxVecOw7ePOO483KIftH/a6IpEygRPNwsvRDViwfAO1A+syOnULIZ+/jvPf7OF5PM51z6BiBhCgcUVZaXB3L/B+XMQ2hp6/VNzrIgUsZZhlVj0TCee6FIXkwm+23SM7h+s4Lc9p4wuTUQKoMDiamw2+GE4nNoBPoHQ70tw131RRIqDl7uFMXfexPwn21MnwIeElCyGfLaBl+dvJTVTrS0irkSBxdWs+j/YtRDM7tDvC/CvbnRFImVeeK0qLHm2M492rIPJBPN+P0rUBytZvS/R6NJE5AIFFleyezH89rbj8d2ToGaEsfWIlCPeHhbG9WrC18NuoWaVCpxIzmTgzPX8/YdtpGflGl2eSLmnwOIqTu2CBY87Hrd7Am4eZGw9IuVURN2qLB3VmUHtawHwxbo4oiavZO2BMwZXJlK+KbC4goyzMHcAZKdB7c4Q9bbRFYmUaxU83Hjj3mZ8NTSC6pW8OXbuPANmrOO1hTvIyFZri4gRFFiMZs2F+UPg3CGoVBP6zgGLu9FViQjQoX4AS0d1ZkC7mgDMXnOYO/+5it8PnzW4MpHyR4HFaMvGwcHl4F4B+s8Fn6pGVyQif+Lr5c7EPs2Z82g7gv28OHwmg76fruXtxTvJzLEaXZ5IuaHAYqTNX8G6KY7H902D4GbG1iMiV9S1YTV+fq4LfcNrYLfDjFWHiJq8kgWbjpFrtRldnkiZZ7KXgXutp6Sk4O/vT3JyMn5+peTGgMd+h8/uBGsWdHkJbn/V6IpE5Dr9d3cCo7/bxqlUx12f6wb48Ey3+vRqEYqbRX8HilwvZ76/FViMkHISpt8KafHQ6E7H5HBm/ZITKU3Ss3L5fO0Rpq88wLkMxyRzCi4izlFgcWU5mTD7Ljj+O1RrDI8tAy8Xr1lErigtK5fP1x5mxsqDCi4iTlJgcVV2O/w4AjZ/CV6V4PHfoEpdo6sSkSJwteByT8vqWMy6H5jIXymwuKp102Dpy2Ayw8DvoN7tRlckIkXsSsHl2W4N6NUyVMFF5E+c+f4uVFvllClTqF27Nl5eXkRERBAbG3vFfWfPno3JZMq3eHl55dtn8ODBl+3To0ePwpTmug4uh59fcTzu/pbCikgZVdHTjadurc+ql2/npR6NqFTBnYOJ6Yyat5k7Jq3ghz+OY7WV+r8TRUqc04Fl3rx5REdHM378eDZt2kTLli2Jiori1Kkr35Ldz8+PkydP5i1Hjhy5bJ8ePXrk22fu3LnOlua6zh6CbweD3QotB8AtTxldkYgUs4vBZfXLt/Ni1F+CywcKLiLOcjqwTJo0iWHDhjFkyBCaNGnCtGnTqFChArNmzbriMSaTieDg4LwlKCjosn08PT3z7VO5cmVnS3NNWanw9d/g/DkIvRnungwmNQmLlBcVPd0YcdtfgstpBRcRZzkVWLKzs9m4cSORkZGXTmA2ExkZydq1a694XFpaGrVq1SIsLIx7772XHTt2XLbP8uXLCQwMpFGjRgwfPpwzZ658o7GsrCxSUlLyLS7JZoPvn4RTO6FiEPT/Ety9rn2ciJQ51wouP25WcBG5GqcCS2JiIlar9bIWkqCgIOLj4ws8plGjRsyaNYsff/yRL774ApvNRocOHTh27FjePj169ODzzz8nJiaGd999lxUrVtCzZ0+s1oKnvZ44cSL+/v55S1hYmDNvo+SsfA92LwKLB/T7AvxCja5IRAx2peAy8msFF5GrcWqU0IkTJ6hevTpr1qyhffv2edtfeuklVqxYwfr16695jpycHG666SYGDBjAm2++WeA+Bw8epF69evz6669069btsuezsrLIysrKW09JSSEsLMy1Rgnt+gnmDXQ8vncKtB5obD0i4pJSM3P4fO0RZqw6SNLFUUXVfBjZrQF3t9CoIinbim2UUEBAABaLhYSEhHzbExISCA4Ovq5zuLu707p1a/bv33/FferWrUtAQMAV9/H09MTPzy/f4lISdsCCJxyPI55UWBGRK/L1cmfEbfVZ9dJtanERuQqnAouHhwfh4eHExMTkbbPZbMTExORrcbkaq9XKtm3bCAkJueI+x44d48yZM1fdx2VlnIW5AyAnHep0ge5vG12RiJQCVwsu3RVcRJwfJRQdHc2MGTOYM2cOu3btYvjw4aSnpzNkyBAABg0axJgxY/L2f+ONN/jll184ePAgmzZtYuDAgRw5coShQ4cCjg65L774IuvWrePw4cPExMRw7733Ur9+faKioorobZYQa65j+HLSEahUC/rOAYub0VWJSCny1+Di7+3OAQUXEZz+Nu3Xrx+nT59m3LhxxMfH06pVK5YuXZrXETcuLg7zn27kd+7cOYYNG0Z8fDyVK1cmPDycNWvW0KRJEwAsFgtbt25lzpw5JCUlERoaSvfu3XnzzTfx9PQsordZQpaNhUMrwN0HBsyFClWMrkhESqmLwWVQ+1rMWXOYGasO5QWXD2P28az6uEg5o6n5i8ofX8KPFyaEe/Df0OQeY+oQkTIpNTMnL7gkn3d0zq1XzUfBRUo13UuopB37HT7rCdZs6Doabhtz7WNERAqhoOBSP7Aiz3ZrwF3NQxRcpFRRYCnRFz8J02+FtHhofLejdcWs28mLSPFScJGyQIGlpORkwuw74fhGqHYTDF0Gnr4l9/oiUu6lZOYw53+H+dfqS8GlToAPj3epy32tq+PlbjG4QpErU2ApCXY7/PAUbPkKvCrB479Blbol89oiIn9RUHCp5uvJox3r8NAtNfHzcje4QpHLKbCUhLWfwM9jwGSBgd9BvdtK5nVFRK4iPSuXubFxzFx9iJPJmQD4errxt1tq8ljHOgT66X5m4joUWIrbgd/giz5gt0GPd+CW4cX/miIiTsjOtbFwywk+XXGAfafSAPCwmOlzc3Ue71KXutUqGlyhiAJL8b7Y2YMw/TbITIJWDznuE2RS5zYRcU02m53/7j7FtBUH+P3IOcDxK6tH02Ce7FqPlmGVjC1QyjUFluKSlQr/ugNO74LqbWDwYnBX86qIlA6/Hz7LtBUH+HXXqbxt7etW5clb69GlQQAm/fElJUyBpTjYbPDNw7B7EVQMhseXg18pvNeRiJR7exNSmbbiAAs3nyD3wjT/TUL8ePLWetzZLBg3i6ZmkJKhwFIcfpsIK94BiwcM+Q/UaFM8ryMiUkKOJ51n5qpDfL0hjoxsKwBhVbx5vHNd+rYJ05BoKXYKLEVt50JH6wpA76nQ6m9F/xoiIgZJysjm87VHmL3mMGfTswGo6uPB4A61GdS+Nv4VNCRaiocCS1FK2OHot5KTDrc8BT0mFu35RURcxPlsK99uPMr0lQc5du48ABU8LAxoV5PHOtUhtJK3wRVKWaPAUlTSz8CMWyEpDureCg99Bxanb3AtIlKq5FptLN52kqnLD7A7PhUAN7OJe1tV58mudWkQpBm9pWgosBSVswfhywfBlgPDfoMKVYru3CIiLs5ut7Ni72mmrTjAuoNn87ZH3hTE8FvrEl5LvxPlxiiwFKXMZEhPhKr1iva8IiKlyOajSUxbfoCfd8Zz8Vujbe3KPNm1Hrc1CsSsmy1KISiwiIhIsThwOo0ZKw+yYNNxsq02ABoGVeSJLvW4p1Uo7hoSLU5QYBERkWKVkJLJrNWH+HJ9HGlZuQBUr+TNY53q0L9dGBU81N9Prk2BRURESkTy+Ry+XH+EWasPk5iWBUClCu4Mal+bwR1qU8XHw+AKxZUpsIiISInKzLGyYNNxpq88wOEzGQB4uZvp1yaMoZ3rElalgsEViitSYBEREUNYbXaWbo9n2ooDbDueDIDFbOKelqG8GNVIc7lIPgosIiJiKLvdzpoDZ5i24gCr9iUC4O1u4Zlu9RnaqS4ebuqcKwosRpcjIiJ/su1YMm8s2sGGw+cAqBvgw2v3NKVLw2oGVyZGU2ARERGXYrfb+WHzcd5evDuvc26PpsGM7dWE6rpMVG458/2tNjkRESl2JpOJ+1rX4L8vdOXRjnWwmE0s3RFPt/eXM+W3/WTlWo0uUVycWlhERKTE7Y5PYdyPO4g95Jjyv06AD+N7NeHWRoEGVyYlSZeERETE5dntdhZuOcHbi3dxKtVxmah7kyDG3t1Ew6DLCV0SEhERl2cyOe4AHfN8V4Z2clwm+mVnAnd8sIKPYvaRmaPLRHKJWlhERMQl7E1IZdyP2/PuDF2ragVe69WU2xrrMlFZpRYWEREpdRoG+TJ32C18OKA1QX6eHDmTwZDZGxg653eOns0wujwxmAKLiIi4DJPJMStuzPO38niXuriZTfy6K4HISSuY/OteXSYqx3RJSEREXNa+hFTGL9zBmgNnAKhZpQLjezWh201BBlcmRUGjhEREpMyw2+0s3naStxbtIj4lE4BujQMZ36spNatqNFFppj4sIiJSZphMJu5uEUrM8115sms93C0mYnafIvKDFXywTJeJygu1sIiISKmy/1Qary3cwer9jpsqhlXxZtzdTYm8KRCTyWRwdeKMYm9hmTJlCrVr18bLy4uIiAhiY2OvuO/s2bMxmUz5Fi8vr3z72O12xo0bR0hICN7e3kRGRrJv377ClCYiImVc/cCK/Puxdnzy0M2E+Htx9Ox5hn3+O4/O3sDhxHSjy5Ni4nRgmTdvHtHR0YwfP55NmzbRsmVLoqKiOHXq1BWP8fPz4+TJk3nLkSNH8j3/3nvv8eGHHzJt2jTWr1+Pj48PUVFRZGZmOv+ORESkzDOZTNzZPISY57sy/FbHZaLf9pym+wcref+XPZzP1mWissbpS0IRERG0bduWjz/+GACbzUZYWBjPPPMMo0ePvmz/2bNnM2rUKJKSkgo8n91uJzQ0lOeff54XXngBgOTkZIKCgpg9ezb9+/e/Zk26JCQiUr4dOO24TLRqn+MyUfVK3ozr1YTuTYJ0mciFFdsloezsbDZu3EhkZOSlE5jNREZGsnbt2isel5aWRq1atQgLC+Pee+9lx44dec8dOnSI+Pj4fOf09/cnIiLiqucUERG5qF61inz+aDumDbyZ6pW8OZ50nif+vZHBn23gkC4TlQlOBZbExESsVitBQfnHvwcFBREfH1/gMY0aNWLWrFn8+OOPfPHFF9hsNjp06MCxY8cA8o5z5pxZWVmkpKTkW0REpHwzmUz0aBbCr9Fdefq2+nhYzKzYe5qoD1byj593k5Gda3SJcgOKfVhz+/btGTRoEK1ataJr164sWLCAatWq8emnnxb6nBMnTsTf3z9vCQsLK8KKRUSkNPP2sPBCVCN+fq4LXRtWI9tqY8pvB7hj0kqWbj9JGRgcWy45FVgCAgKwWCwkJCTk256QkEBwcPB1ncPd3Z3WrVuzf/9+gLzjnDnnmDFjSE5OzluOHj3qzNsQEZFyoE6AD7OHtOXTh8PzLhM9+cUmBs2K5eDpNKPLEyc5FVg8PDwIDw8nJiYmb5vNZiMmJob27dtf1zmsVivbtm0jJCQEgDp16hAcHJzvnCkpKaxfv/6K5/T09MTPzy/fIiIi8lcmk4mopsH8Gt2VZ2+vj4ebmVX7EomavJJ3l+oyUWni9CWh6OhoZsyYwZw5c9i1axfDhw8nPT2dIUOGADBo0CDGjBmTt/8bb7zBL7/8wsGDB9m0aRMDBw7kyJEjDB06FHD8Yxo1ahRvvfUWCxcuZNu2bQwaNIjQ0FB69+5dNO9SRETKNW8PC9HdG/HLqC7c1qgaOVY7U5c7LhMt33PlaTnEdbg5e0C/fv04ffo048aNIz4+nlatWrF06dK8TrNxcXGYzZdy0Llz5xg2bBjx8fFUrlyZ8PBw1qxZQ5MmTfL2eemll0hPT+fxxx8nKSmJTp06sXTp0ssmmBMREbkRtQN8mDW4Lb/uOsVrC3dwPOk8gz/bQO9WoYy9uwlVK3oaXaJcgabmFxGRcik9K5dJy/by2f8OYbND5QrujOvVhN6tqmvulhKimx+KiIhcg4+nG2PvbsL3T3WkcbAv5zJyeG7eFh75bANHz2YYXZ78hQKLiIiUay3DKvHTM514MaoRHm5mVu51TPH/r1UHsdpK/UWIMkOBRUREyj13i5kRt9Vn6cjORNSpwvkcK28t3kWfT/7HzhOanNQVKLCIiIhcULdaReYOu4V3+jTH18uNLceSuefj1by3dDeZObqhopEUWERERP7EbDbRv11NYqK70rNZMLk2O58sP0DPf65i3cEzRpdXbimwiIiIFCDQz4upA8P59OFwgvw8OZSYTv/p6xizYCvJ53OMLq/cUWARERG5iqimwSyL7spDETUBmBt7lMhJK1i6/aTBlZUvCiwiIiLX4Oflztv3NeebJ9pTt5oPp1OzePKLTTz++e/EJ2caXV65oMAiIiJyndrVqcKSZzvzzO31cTOb+GVnAndMWsGX649g0xDoYqXAIiIi4gQvdwvPd2/Eomc70SqsEqlZubz6/Xb6T1/HAd0FutgosIiIiBRC42A/vhvegfG9mlDBw0Ls4bP0nLyKj2L2kZ1rM7q8MkeBRUREpJAsZhNDOtbhl+e6cGujamRbbby/bC+9PlrNH3HnjC6vTFFgERERuUE1Klfgs8Ft+Wf/VlTx8WBPQip9pq7htYU7SM/KNbq8MkGBRUREpAiYTCbubVWdX6O70qd1dex2mL3mMN0/WMlve04ZXV6pp8AiIiJShKr4eDCpXys+f7QdNSp7czzpPEM+28DIr//gTFqW0eWVWgosIiIixaBLw2r88lwXhnaqg9kEP24+QeSkFSzYdAy7XUOgnaXAIiIiUkwqeLjx97ub8MOIjtwU4se5jByiv9nCoFmxHD2bYXR5pYoCi4iISDFrUaMSC5/uyEs9GuHhZmbVvkS6f7CSf606SK5VQ6CvhwKLiIhICXC3mHnq1vr8PKoLt9StwvkcK28t3kWfqWvYeSLF6PJcngKLiIhICaoT4MPcYbfw7v3N8fVyY+uxZHp9vJp3l+4mM8dqdHkuS4FFRESkhJlMJvq1rUlMdFfubB6M1WZn6vID9Ji8kjUHEo0uzyUpsIiIiBgk0M+LTx4K59OHwwny8+TwmQz+NmM97y7djVU3U8xHgUVERMRgUU2DWRbdlb9F1ARg6vIDDP4slrPp2QZX5joUWERERFyAn5c7E+5rzocDWuPtbmHVvkR6fbSabceSjS7NJSiwiIiIuJB7Woby/YgO1K5ageNJ57l/2hq++f2o0WUZToFFRETExTQO9uPHpzvRrXEg2bk2Xpq/lVe/30ZWbvkdRaTAIiIi4oL8vd2ZMagN0Xc0xGSCL9fH0X/6OuKTM40uzRAKLCIiIi7KbDbxbLcGzBrcFj8vN/6IS+Luj1ax7uAZo0srcQosIiIiLu62RoH89EwnGgf7kpiWzUP/Ws+/Vh0sVzdRVGAREREpBWpV9eH7pzrSu1UoVpudtxbv4pm5f5CRnWt0aSVCgUVERKSU8Paw8EG/VrzWqwluZhOLtp7kvilrOJSYbnRpxU6BRUREpBQxmUwM7liHuY/fQjVfT/YkpHLPx6uJ2ZVgdGnFSoFFRESkFGpbuwqLn+lEm1qVSc3M5bE5vzNp2V5sZXRKfwUWERGRUirQz4uvht3CI+1rAfBhzD4em7OB5IwcgysregosIiIipZiHm5nX723GpAdb4ulm5rc9p+n18Wp2nkgxurQiVajAMmXKFGrXro2XlxcRERHExsZe13Fff/01JpOJ3r1759s+ePBgTCZTvqVHjx6FKU1ERKRc6nNzDb4b3oEalb2JO5tBn6n/4/s/jhldVpFxOrDMmzeP6Ohoxo8fz6ZNm2jZsiVRUVGcOnXqqscdPnyYF154gc6dOxf4fI8ePTh58mTeMnfuXGdLExERKdeaVfdn0TOd6NqwGpk5Np6bt4XXFu4gx2ozurQb5nRgmTRpEsOGDWPIkCE0adKEadOmUaFCBWbNmnXFY6xWKw899BCvv/46devWLXAfT09PgoOD85bKlSs7W5qIiEi5V6mCB7MGt+XZ2+sDMHvNYf42Yx2nUkr3lP5OBZbs7Gw2btxIZGTkpROYzURGRrJ27dorHvfGG28QGBjIY489dsV9li9fTmBgII0aNWL48OGcOXPlaYezsrJISUnJt4iIiIiDxWwiunsjZgxqg6+nGxsOn+Puj1az8chZo0srNKcCS2JiIlarlaCgoHzbg4KCiI+PL/CY1atXM3PmTGbMmHHF8/bo0YPPP/+cmJgY3n33XVasWEHPnj2xWgu+K+XEiRPx9/fPW8LCwpx5GyIiIuXCHU2C+PHpjjQMqsip1Cz6fbqOz9ceLpVT+hfrKKHU1FQefvhhZsyYQUBAwBX369+/P/fccw/Nmzend+/eLFq0iA0bNrB8+fIC9x8zZgzJycl5y9GjR4vpHYiIiJRudatV5PunOnJXixBybXbG/biD57/ZwvnsghsFXJWbMzsHBARgsVhISMg/m15CQgLBwcGX7X/gwAEOHz5Mr1698rbZbI6OP25ubuzZs4d69epddlzdunUJCAhg//79dOvW7bLnPT098fT0dKZ0ERGRcsvH042PB7SmVY1KvLN0Nwv+OM6u+FQ+HRhOzaoVjC7vujjVwuLh4UF4eDgxMTF522w2GzExMbRv3/6y/Rs3bsy2bdvYvHlz3nLPPfdw2223sXnz5iteyjl27BhnzpwhJCTEybcjIiIiBTGZTAzrUpcvHougqo8Hu06m0Ovj1Szfc/VRvq7C6UtC0dHRzJgxgzlz5rBr1y6GDx9Oeno6Q4YMAWDQoEGMGTMGAC8vL5o1a5ZvqVSpEr6+vjRr1gwPDw/S0tJ48cUXWbduHYcPHyYmJoZ7772X+vXrExUVVbTvVkREpJxrX68qi57tRMuwSiSfz2HI7A18FLPP5af0d+qSEEC/fv04ffo048aNIz4+nlatWrF06dK8jrhxcXGYzdefgywWC1u3bmXOnDkkJSURGhpK9+7defPNN3XZR0REpBiE+HvzzRO38PpPO/lqfRzvL9vLlmPJTOrXEj8vd6PLK5DJXhq7Cv9FSkoK/v7+JCcn4+fnZ3Q5IiIipca8DXGM/XEH2bk26gT4MG1gOI2CfUvktZ35/ta9hERERMqxfm1rMv/J9oT6e3EoMZ37Pvkfi7aeMLqsyyiwiIiIlHMtalTip2c60bF+VTKyrTz91R+8vXgnuS40pb8Ci4iIiFC1oidzhrTjya6O6UZmrDrEwJnrSUzLMrgyBwUWERERAcDNYmZ0z8ZMfehmfDwsrDt4ll4freaPuHNGl6bAIiIiIvn1bB7Cj093pG41H04mZ9Lv03XMjY0ztCYFFhEREblM/UBffhzRkaimQWRbbfz9h+3sP5VmWD1Oz8MiIiIi5YOvlzvTBoYzdcUBPCxm6gdWNKwWBRYRERG5IpPJxFO31je6DF0SEhEREdenwCIiIiIuT4FFREREXJ4Ci4iIiLg8BRYRERFxeQosIiIi4vIUWERERMTlKbCIiIiIy1NgEREREZenwCIiIiIuT4FFREREXJ4Ci4iIiLg8BRYRERFxeWXibs12ux2AlJQUgysRERGR63Xxe/vi9/jVlInAkpqaCkBYWJjBlYiIiIizUlNT8ff3v+o+Jvv1xBoXZ7PZOHHiBL6+vphMpiI9d0pKCmFhYRw9ehQ/P78iPXdZo8/q+umzun76rJyjz+v66bO6fsX1WdntdlJTUwkNDcVsvnovlTLRwmI2m6lRo0axvoafn5/+QV8nfVbXT5/V9dNn5Rx9XtdPn9X1K47P6lotKxep062IiIi4PAUWERERcXkKLNfg6enJ+PHj8fT0NLoUl6fP6vrps7p++qyco8/r+umzun6u8FmViU63IiIiUraphUVERERcngKLiIiIuDwFFhEREXF5CiwiIiLi8hRYrmHKlCnUrl0bLy8vIiIiiI2NNboklzNx4kTatm2Lr68vgYGB9O7dmz179hhdVqnwzjvvYDKZGDVqlNGluKTjx48zcOBAqlatire3N82bN+f33383uiyXY7VaGTt2LHXq1MHb25t69erx5ptvXtf9WcqDlStX0qtXL0JDQzGZTPzwww/5nrfb7YwbN46QkBC8vb2JjIxk3759xhRrsKt9Vjk5Obz88ss0b94cHx8fQkNDGTRoECdOnCiR2hRYrmLevHlER0czfvx4Nm3aRMuWLYmKiuLUqVNGl+ZSVqxYwYgRI1i3bh3Lli0jJyeH7t27k56ebnRpLm3Dhg18+umntGjRwuhSXNK5c+fo2LEj7u7u/Oc//2Hnzp28//77VK5c2ejSXM67777L1KlT+fjjj9m1axfvvvsu7733Hh999JHRpbmE9PR0WrZsyZQpUwp8/r333uPDDz9k2rRprF+/Hh8fH6KiosjMzCzhSo13tc8qIyODTZs2MXbsWDZt2sSCBQvYs2cP99xzT8kUZ5crateunX3EiBF561ar1R4aGmqfOHGigVW5vlOnTtkB+4oVK4wuxWWlpqbaGzRoYF+2bJm9a9eu9pEjRxpdkst5+eWX7Z06dTK6jFLhrrvusj/66KP5tvXp08f+0EMPGVSR6wLs33//fd66zWazBwcH2//xj3/kbUtKSrJ7enra586da0CFruOvn1VBYmNj7YD9yJEjxV6PWliuIDs7m40bNxIZGZm3zWw2ExkZydq1aw2szPUlJycDUKVKFYMrcV0jRozgrrvuyvfvS/JbuHAhbdq0oW/fvgQGBtK6dWtmzJhhdFkuqUOHDsTExLB3714AtmzZwurVq+nZs6fBlbm+Q4cOER8fn+//RX9/fyIiIvS7/jokJydjMpmoVKlSsb9Wmbj5YXFITEzEarUSFBSUb3tQUBC7d+82qCrXZ7PZGDVqFB07dqRZs2ZGl+OSvv76azZt2sSGDRuMLsWlHTx4kKlTpxIdHc0rr7zChg0bePbZZ/Hw8OCRRx4xujyXMnr0aFJSUmjcuDEWiwWr1crbb7/NQw89ZHRpLi8+Ph6gwN/1F5+TgmVmZvLyyy8zYMCAErl5pAKLFKkRI0awfft2Vq9ebXQpLuno0aOMHDmSZcuW4eXlZXQ5Ls1ms9GmTRsmTJgAQOvWrdm+fTvTpk1TYPmLb775hi+//JKvvvqKpk2bsnnzZkaNGkVoaKg+KykWOTk5PPjgg9jtdqZOnVoir6lLQlcQEBCAxWIhISEh3/aEhASCg4MNqsq1Pf300yxatIjffvuNGjVqGF2OS9q4cSOnTp3i5ptvxs3NDTc3N1asWMGHH36Im5sbVqvV6BJdRkhICE2aNMm37aabbiIuLs6gilzXiy++yOjRo+nfvz/Nmzfn4Ycf5rnnnmPixIlGl+byLv4+1+/663cxrBw5coRly5aVSOsKKLBckYeHB+Hh4cTExORts9lsxMTE0L59ewMrcz12u52nn36a77//nv/+97/UqVPH6JJcVrdu3di2bRubN2/OW9q0acNDDz3E5s2bsVgsRpfoMjp27HjZ8Pi9e/dSq1YtgypyXRkZGZjN+X+dWywWbDabQRWVHnXq1CE4ODjf7/qUlBTWr1+v3/UFuBhW9u3bx6+//krVqlVL7LV1SegqoqOjeeSRR2jTpg3t2rVj8uTJpKenM2TIEKNLcykjRozgq6++4scff8TX1zfvuq+/vz/e3t4GV+dafH19L+vb4+PjQ9WqVdXn5y+ee+45OnTowIQJE3jwwQeJjY1l+vTpTJ8+3ejSXE6vXr14++23qVmzJk2bNuWPP/5g0qRJPProo0aX5hLS0tLYv39/3vqhQ4fYvHkzVapUoWbNmowaNYq33nqLBg0aUKdOHcaOHUtoaCi9e/c2rmiDXO2zCgkJ4YEHHmDTpk0sWrQIq9Wa9/u+SpUqeHh4FG9xxT4OqZT76KOP7DVr1rR7eHjY27VrZ1+3bp3RJbkcoMDls88+M7q0UkHDmq/sp59+sjdr1szu6elpb9y4sX369OlGl+SSUlJS7CNHjrTXrFnT7uXlZa9bt6791VdftWdlZRldmkv47bffCvwd9cgjj9jtdsfQ5rFjx9qDgoLsnp6e9m7dutn37NljbNEGudpndejQoSv+vv/tt9+KvTaT3a6pEEVERMS1qQ+LiIiIuDwFFhEREXF5CiwiIiLi8hRYRERExOUpsIiIiIjLU2ARERERl6fAIiIiIi5PgUVERERcngKLiIiIuDwFFhEREXF5CiwiIiLi8hRYRERExOX9P5nLieM5ShBxAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist11.history['loss'])\n", "plt.plot(hist11.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 154, "id": "47febf06-95ac-4b06-ba56-2ad47ff1a3ab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 2s 156ms/step - loss: 0.5958 - accuracy: 0.7121\n" ] } ], "source": [ "loss, accuracy = model11.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "c2a5d3c4-3c49-4be9-8ded-23687e564b2b", "metadata": {}, "source": [ "As we can see with this learning rate we do not achieve better performance then with its default value, also there are many hills sygnifying problems with using this learning rate. The loss function at first sees not much improvement to then dramatically fast decrease. That is not an appriciated effect." ] }, { "cell_type": "markdown", "id": "7dfec583-2f2d-405a-8d7f-c3b79316af76", "metadata": {}, "source": [ "### MODEL 12" ] }, { "cell_type": "markdown", "id": "4e05a9cb-1e97-4acb-91b1-abd3199581f5", "metadata": {}, "source": [ "* For the last model we try different output representation values of 200 and 300, only to see worse performace.\n", "* Accuracy for 300: 67%" ] }, { "cell_type": "code", "execution_count": 161, "id": "ebbb2631-96b3-4dae-a224-98eb30dd7b51", "metadata": {}, "outputs": [], "source": [ "adamax_opt = Adamax(learning_rate = 0.001)" ] }, { "cell_type": "code", "execution_count": 162, "id": "0bd3e79e-d0db-4cea-b558-79d086e6059f", "metadata": {}, "outputs": [], "source": [ "# We are going to build our model with the Sequential API\n", "model12 = Sequential()\n", "\n", "model12.add(Embedding(total_words, # number of words to process as input\n", " 300, # output representation\n", " input_length=len(padded_sequences[0]))) # total length of each observation\n", "\n", "#model12.add(LSTM(100, return_sequences=False))\n", "model12.add(Bidirectional(LSTM(100, return_sequences=False)))\n", "\n", "model12.add(Dropout(0.2))\n", "\n", "model12.add(Dense(1, activation='sigmoid')) \n", "\n", "model12.compile(optimizer= adamax_opt, loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 163, "id": "eb13fc7a-bfaf-4c54-871c-6f2a09a3417b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_21\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_21 (Embedding) (None, 404, 300) 6280800 \n", " \n", " bidirectional_18 (Bidirect (None, 200) 320800 \n", " ional) \n", " \n", " dropout_24 (Dropout) (None, 200) 0 \n", " \n", " dense_21 (Dense) (None, 1) 201 \n", " \n", "=================================================================\n", "Total params: 6601801 (25.18 MB)\n", "Trainable params: 6601801 (25.18 MB)\n", "Non-trainable params: 0 (0.00 Byte)\n", "_________________________________________________________________\n" ] } ], "source": [ "model12.summary()" ] }, { "cell_type": "code", "execution_count": 164, "id": "86555b73-9adc-4e74-a704-9313baa5b438", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/20\n", "57/57 [==============================] - 21s 318ms/step - loss: 0.6926 - accuracy: 0.5388 - val_loss: 0.6801 - val_accuracy: 0.6813\n", "Epoch 2/20\n", "57/57 [==============================] - 17s 295ms/step - loss: 0.6032 - accuracy: 0.7160 - val_loss: 0.5951 - val_accuracy: 0.6791\n", "Epoch 3/20\n", "57/57 [==============================] - 17s 303ms/step - loss: 0.4948 - accuracy: 0.7887 - val_loss: 0.6057 - val_accuracy: 0.6901\n", "Epoch 4/20\n", "57/57 [==============================] - 17s 299ms/step - loss: 0.4176 - accuracy: 0.8459 - val_loss: 0.6326 - val_accuracy: 0.6681\n", "Epoch 5/20\n", "57/57 [==============================] - 17s 303ms/step - loss: 0.2975 - accuracy: 0.8943 - val_loss: 0.6705 - val_accuracy: 0.6879\n" ] } ], "source": [ "hist12 = model12.fit(X_train, y_train, epochs=20, validation_data = (X_val, y_val), callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": 165, "id": "f20409c0-dd4e-4f79-96bb-8378eb3f0424", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(hist12.history['loss'])\n", "plt.plot(hist12.history['accuracy'])" ] }, { "cell_type": "code", "execution_count": 166, "id": "5322ca98-8d8c-421e-ab81-20888f93c2c3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15/15 [==============================] - 1s 79ms/step - loss: 0.5951 - accuracy: 0.6791\n" ] } ], "source": [ "loss, accuracy = model12.evaluate(X_val, y_val)" ] }, { "cell_type": "markdown", "id": "955a0345-dc3d-4bde-b50b-bacf387a36e0", "metadata": {}, "source": [ "Link to an excel SpreadSheet containing model performance, also with No 50-50 split and title only. Word2vec proved to work similar, depending on the model, 2-3% better or arpund the same worse, therefore we decided to leave it out and use lemmatization.\n", "\n", "https://docs.google.com/spreadsheets/d/1Vcnnh5MvkoVpfSyF4jWzxz93QvzuRA8cQcuvnS_GXrk/edit#gid=1601168295" ] }, { "cell_type": "markdown", "id": "39fba4b6-2dd2-42b2-b8fe-3766114476d0", "metadata": {}, "source": [ "# Testing Our Best Performing Model" ] }, { "cell_type": "markdown", "id": "7df44021-f863-4676-b120-bf7a4e1f1f68", "metadata": {}, "source": [ "As shown above the 5th model proves to be the best performer, we will now proceed to the testing process, where we first check its performance of the test set to then predict the test results, and create a confusion matrix resembling its predictive performance. The threshold is yet again set to 0.5." ] }, { "cell_type": "code", "execution_count": 167, "id": "aa397c66-19fb-41e7-a5c5-c6f85e5abc57", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/18 [==============================] - 3s 153ms/step - loss: 0.5891 - accuracy: 0.6937\n" ] } ], "source": [ "loss, accuracy = model5.evaluate(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 169, "id": "20890c30-3f0c-473f-b70f-cb3387579fb0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/18 [==============================] - 7s 148ms/step\n" ] } ], "source": [ "#Prection and Confusion Matrix\n", "y_pred = model5.predict(X_test)\n", "bin_y_pred = (y_pred > 0.5).astype(int)" ] }, { "cell_type": "code", "execution_count": 170, "id": "39b983f0-3458-41ba-8562-24e78383142d", "metadata": {}, "outputs": [], "source": [ "bin_y_pred = np.squeeze(bin_y_pred)" ] }, { "cell_type": "code", "execution_count": 171, "id": "e4c5d136-3d81-4122-ad79-a2463e629f42", "metadata": {}, "outputs": [], "source": [ "y_true = y_test\n", "y_pred = bin_y_pred" ] }, { "cell_type": "code", "execution_count": 189, "id": "f085f05e-2191-4c79-9453-63d8716a7ca2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Set Results:\n", "\n", " Predicted No Predicted Yes \n", "Actual No 194 88 \n", "Actual Yes 86 200 \n", "\n", "Precision: 0.6944\n", "Recall: 0.6993\n", "Accuracy: 0.6937\n" ] } ], "source": [ "cm = confusion_matrix(y_true, y_pred)\n", "\n", "TN, FP, FN, TP = cm.ravel()\n", "\n", "print('Test Set Results:\\n')\n", "\n", "print(f\"{'':<20}{'Predicted No':<20}{'Predicted Yes':<20}\")\n", "print(f\"{'Actual No':<20}{TN:<20}{FP:<20}\")\n", "print(f\"{'Actual Yes':<20}{FN:<20}{TP:<20}\")\n", "\n", "print(\"\\nPrecision:\", round(TP/(TP + FP), 4))\n", "print(\"Recall:\", round(TP/(TP + FN), 4))\n", "print(\"Accuracy:\", round((TP+TN)/(TP + TN + FP + FN), 4))" ] }, { "cell_type": "markdown", "id": "f23234ee-6f8c-41c1-82ac-20b0918838b0", "metadata": {}, "source": [ "In the results above we can see a much more imporved model, with better metrics when compared to the base model. Nearly 70% accuracy, recall and precision. Makes this a well performing model. Especially when compared to others we checked. The test accuracy does not fall out of bounds with the validation result, signaling the validity of the performance of our model outside of the train set, and a good balance of each cases in each of the splits made. " ] }, { "cell_type": "markdown", "id": "6cf17064-b36e-4981-bd7e-18ea19aca27b", "metadata": {}, "source": [ "# Custom Text Import And Prediction" ] }, { "cell_type": "markdown", "id": "413a2be1-d873-4572-855c-c36f030a777c", "metadata": {}, "source": [ "To prove the performance of the model and simply to check its real life purpose, we have checked real recent articles from Bloomberg and CNBC, in both classes of relevancy. First, we created a function in which 3 arguments are provided - the headline string, the text string and the length of padded sequences to set a max article length of 404 words our model can intake. \n", "* The strings are combined into a single string, then the cleaning process is done, along with removing the stopwords, and lemmatization.\n", "* An error will be brought out if the length of the given article is > 404 words or the padded sequence length.\n", "* If however this is not the case we proceed into turning the text into sequences and padding them given the max length given as an argument.\n", "* As the last step a prediction is made and later classified as Economic or Non Economic depending if the value is smaller or bigger then the threshold of 0.5.\n", "* The demonstration of the function in work with real articles is given below. " ] }, { "cell_type": "code", "execution_count": 250, "id": "95f9a369-decb-4663-85b0-f9e266d83bff", "metadata": {}, "outputs": [], "source": [ "def is_txt_econ(headline, text, max_l):\n", " whole_txt = headline + ' ' + text\n", " \n", " # Taking out '
' in the 'whole_text' column\n", " whole_txt = re.sub(r'', ' ', whole_txt)\n", " # Deletion of non-latin alfabet signs, also numbers\n", " whole_txt = re.sub(r'[^a-zA-Z]', ' ', whole_txt)\n", " # Removing single letter works like 'a'.\n", " whole_txt = re.sub(r\"\\s+[a-zA-Z]\\s+\", ' ', whole_txt)\n", " # Removing double spaces\n", " whole_txt = re.sub(r'\\s+', ' ', whole_txt)\n", " # Lower case\n", " whole_txt = whole_txt.lower()\n", " whole_txt = word_tokenize(whole_txt)\n", " whole_txt = [word for word in whole_txt if word not in stop_words]\n", " whole_txt = [lemmatizer.lemmatize(word) for word in whole_txt]\n", "\n", " if len(whole_txt) > max_l:\n", " print('ERROR, Article lenght must be < 404')\n", " else:\n", " sequences = tokenizer.texts_to_sequences([whole_txt])\n", " padded_sequences = pad_sequences(sequences, maxlen=max_l)\n", " \n", " predictions = model5.predict(padded_sequences)\n", " if predictions < 0.5:\n", " return(predictions, 'Non Economic')\n", " else:\n", " return(predictions, 'Economic')" ] }, { "cell_type": "code", "execution_count": 261, "id": "b57e9bed-7e5b-48f2-b1af-b6e4f04f6da3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 118ms/step\n" ] }, { "data": { "text/plain": [ "(array([[0.22238688]], dtype=float32), 'Non Economic')" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Bloomberg Article - Economic Relevance, debatable, about changing structures inside the Swiss bank UBS.\n", "# https://www.bloomberg.com/news/articles/2023-12-03/ubs-s-ermotti-to-find-potential-successor-within-three-years?srnd=premium-europe\n", "is_txt_econ('UBS’s Ermotti to Find Potential Successor Within Three Years', 'Sergio Ermotti, chief executive officer of UBS Group AG, says part of his mandate for the next three years is to identify potential successors, he told the Swiss media outlet Bilanz in a television interview. We need to have candidates that we can assess in the next few years; and it is part of my job to present an array of candidates to the board, Ermotti told the broadcaster. UBS Chairman Colm Kelleher said last week that he, along with the bank’s board, was looking to develop a shortlist of potential successors for 63-year-old Ermotti. Read more: Kelleher Says UBS Could Use Morgan Stanley’s CEO Race Playbook An internal successor would be ideal, Ermotti said in Sunday’s interview, while not ruling out an external candidate. When the bank is successful, it is better to have someone who knows the internal mechanisms, he added. UBS brought back Ermotti as CEO in April to oversee the government-brokered rescue of its smaller competitor Credit Suisse. Ermotti previously ran UBS from 2011 to 2020.', len(padded_sequences[0]))" ] }, { "cell_type": "code", "execution_count": 262, "id": "d74b55cb-9b24-4f31-91b9-84c270299266", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 119ms/step\n" ] }, { "data": { "text/plain": [ "(array([[0.14373618]], dtype=float32), 'Non Economic')" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# CNBC - Surely Non Economic, regarding a terrorist attack in Paris\n", "# https://www.cnbc.com/2023/12/03/one-dead-two-injured-after-tourists-attacked-near-paris-eiffel-tower.html\n", "is_txt_econ('One dead, two injured after man attacks tourists near Paris Eiffel Tower', 'One person died and two others were injured after a man attacked tourists in central Paris near the Eiffel Tower, Interior Minister Gerald Darmanin said on Saturday. Police quickly arrested the 26-year-old man, a French national, using a Taser stun gun, Darmanin told reporters. The suspect had been sentenced to four years in prison in 2016 for planning another attack and was on the French security services watch list, and was also known for having psychiatric disorders, the interior minister added.The attack took place around 1900 GMT when the man attacked a tourist couple with a knife on the Quai de Grenelle, a few feet away from the Eiffel Tower, mortally wounding a German national. He was then chased by police and attacked two other people with a hammer before being arrested. The suspect had shouted out Allahu Akbar (God is greatest) and told police he was upset because so many Muslims are dying in Afghanistan and in Palestine and was also upset about the Gaza situation, Darmanin said. The anti-terrorism prosecutors office said it was in charge of the investigation. Saturday nights incident in central Paris occurred less than eight months before the French capital hosts the Olympic Games and could raise questions about security at the global sporting event. Paris plans an unprecedented opening ceremony on the Seine river that may draw as many as 600,000 spectators.', len(padded_sequences[0]))" ] }, { "cell_type": "code", "execution_count": 263, "id": "a212274f-e72a-41aa-8aa4-7629e17f5627", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 100ms/step\n" ] }, { "data": { "text/plain": [ "(array([[0.5134268]], dtype=float32), 'Economic')" ] }, "execution_count": 263, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# CNBC, article about rising gold prices due to FED policies.\n", "# https://www.cnbc.com/2023/12/01/gold-set-for-3rd-weekly-gain-as-cooler-data-cements-fed-cut-bets.html\n", "is_txt_econ('Gold hits record high on bets for March start to Fed rate cuts', 'Gold prices rallied to an all-time high on Friday after remarks from Federal Reserve Chair Jerome Powell increased traders confidence the U.S. central bank had completed its monetary policy tightening and could cut rates starting March. Spot gold climbed 1.6% to $2,069.10 per ounce. Prices were 3.4% higher on the week, and earlier rose to $2,075.09 per ounce to beat the previous all-time high of $2,072.49 reached in 2020. U.S. gold futures also settled 1.6% higher at a record peak of $2,089.7. Those records, however, are in nominal terms only. On an inflation-adjusted basis, accounting for the depreciation of the dollar and the effect of higher prices, gold’s all-time was reached in early 1980 at what today would equal $3,452.40 an ounce. Speaking at Spelman College in Atlanta, Powell said the risks of under- and over-tightening are becoming more balanced, but the Fed is not thinking about lowering rates right now. Gold bulls are focusing on Powells comment that [the current] rate is well into restrictive territory, which plays into the narrative that cuts will come sooner, pointedly ignoring his warning that it was premature to speculate on easing rates, said Tai Wong, a New York-based independent metals trader.', len(padded_sequences[0]))" ] }, { "cell_type": "code", "execution_count": 264, "id": "ee1b3eee-f6c6-4d9f-8bd4-169a7a608aa6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 103ms/step\n" ] }, { "data": { "text/plain": [ "(array([[0.7161325]], dtype=float32), 'Economic')" ] }, "execution_count": 264, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# CNBC Oil Prices Article\n", "# https://www.cnbc.com/2023/12/01/oil-prices-set-to-rise-in-2024-after-opec-voluntary-cuts.html\n", "is_txt_econ('Oil prices could reach $100 a barrel in 2024 if OPEC+ members fulfil pledges for voluntary cuts', 'Oil prices are expected to rise in the new year after some OPEC+ oil producers voluntarily pledged to cut output. The oil cartel on Thursday released a statement that did not formally endorse production cuts, but individual countries announced voluntary reductions totaling 2.2 million barrels per day for the first quarter of 2024. Leading the cuts is OPEC kingpin and largest member Saudi Arabia. Riyadh agreed to extend its voluntary production cut of 1 million barrels per day — which has been in place since July — until the end of the first quarter of 2024. Russia said it will cut supply by 300,000 barrels per day of crude and 200,000 barrels per day of petroleum products over the same period. Iraq is cutting by 223,000 bpd, the United Arab Emirates by 163,000 bpd, Kuwait by 135,000 bpd, Kazakhstan by 82,000 bpd, Algeria by 51,000 bpd and Oman by 42,000 bpd. Compliance is key. It cant just be Saudi Arabia. We have to have compliance from the other OPEC nations, Bill Perkins, CEO and head trader of Skylar Capital Management, told CNBC. When these other nations say theyre going to cut, the market doesnt trust it as much, he added.', len(padded_sequences[0]))" ] }, { "cell_type": "markdown", "id": "c2bac6aa-8c7d-4727-bb5e-a99d499878a5", "metadata": {}, "source": [ "As we can see the algorithm mostly correctly predicts the articles. However, we can see some limitations of such model.\n", "* It can be debatable for some content to be classified as either economic or not for a human, how is a machine supposed to react to that.\n", "* The words that were not in the initial dictionary before training are just simply ignored and their context is not taken into consideration.\n", "\n", "As for the recommendations for the future…\n", "* More data, with more economic news, we only had 1420 cases of economic news out of 8000.\n", "* This would allow for a better glossary, therefore likely better predictions.\n", "* Recognize news brands, we only had the Wall Street Journal and the Washington Post. Some brands are strictly financial, while others are not. " ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" } }, "nbformat": 4, "nbformat_minor": 5 }