|
--- |
|
language: en |
|
tags: |
|
- Recommendation |
|
license: apache-2.0 |
|
datasets: |
|
- surprise |
|
- numpy |
|
- keras |
|
- pandas |
|
thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png |
|
--- |
|
|
|
![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png) |
|
|
|
|
|
# MCTI Text Classification Task (uncased) DRAFT |
|
|
|
Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project. |
|
|
|
The model [NLP MCTI Recommendation Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/nlp-mcti-lda-recommender) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses |
|
on the task of Recommendation and explores different machine learning strategies that provide suggestions of items that are likely to be handy for a particular individual. Several methods were faced against each other to compare the error estimatives. |
|
Using LDA model, a simulated dataset was created. |
|
|
|
## According to the abstract, |
|
|
|
XXXXX |
|
["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318). |
|
|
|
## Model description |
|
|
|
The surprise library provides 11 classifier models that try to predict the classification of training data based on several different collaborative-filtering techniques. |
|
The models provided with a brief explanation in English are mentioned below, for more information please refer to the package [documentation](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html). |
|
|
|
random_pred.NormalPredictor: Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. |
|
|
|
baseline_only.BaselineOnly: Algorithm predicting the baseline estimate for given user and item. |
|
|
|
knns.KNNBasic: A basic collaborative filtering algorithm. |
|
|
|
knns.KNNWithMeans: A basic collaborative filtering algorithm, taking into account the mean ratings of each user. |
|
|
|
knns.KNNWithZScore: A basic collaborative filtering algorithm, taking into account the z-score normalization of each user. |
|
|
|
knns.KNNBaseline: A basic collaborative filtering algorithm taking into account a baseline rating. |
|
|
|
matrix_factorization.SVD: The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. |
|
|
|
matrix_factorization.SVDpp: The SVD++ algorithm, an extension of SVD taking into account implicit ratings. |
|
|
|
matrix_factorization.NMF: A collaborative filtering algorithm based on Non-negative Matrix Factorization. |
|
|
|
slope_one.SlopeOne: A simple yet accurate collaborative filtering algorithm. |
|
|
|
co_clustering.CoClustering: A collaborative filtering algorithm based on co-clustering. |
|
|
|
Every model was used and evaluated. When faced with each other different methods presented different error estimatives. |
|
|
|
## Intended uses |
|
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to |
|
be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for |
|
fine-tuned versions of a task that interests you. |
|
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) |
|
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text |
|
generation you should look at model like XXX. |
|
### How to use |
|
The datasets for collaborative filtering must be: |
|
- The dataframe containing the ratings. |
|
- It must have three columns, corresponding to the user (raw) ids, |
|
the item (raw) ids, and the ratings, in this order. |
|
```python |
|
>>> import pandas as pd |
|
>>> import numpy as np |
|
|
|
class Data: |
|
```` |
|
The databases (ml_100k, ml_1m and jester) are built-in the surprise package for |
|
collaborative-filtering |
|
```python |
|
def_init_(self): |
|
self.available_databases=['ml_100k', 'ml_1m','jester', 'lda_topics', 'lda_rankings', 'uniform'] |
|
def show_available_databases(self): |
|
print('The avaliable database are:') |
|
for i,database in enumerate(self.available_databases): |
|
print(str(i)+': '+database) |
|
|
|
def read_data(self,database_name): |
|
self.database_name=database_name |
|
self.the_data_reader= getattr(self, 'read_'+database_name.lower()) |
|
self.the_data_reader() |
|
|
|
def read_ml_100k(self): |
|
|
|
from surprise import Dataset |
|
data = Dataset.load_builtin('ml-100k') |
|
self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp']) |
|
self.df.drop(columns=['timestamp'],inplace=True) |
|
self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True) |
|
|
|
def read_ml_1m(self): |
|
|
|
from surprise import Dataset |
|
data = Dataset.load_builtin('ml-1m') |
|
self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp']) |
|
self.df.drop(columns=['timestamp'],inplace=True) |
|
self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True) |
|
|
|
def read_jester(self): |
|
|
|
from surprise import Dataset |
|
data = Dataset.load_builtin('jester') |
|
self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp']) |
|
self.df.drop(columns=['timestamp'],inplace=True) |
|
self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True) |
|
``` |
|
|
|
Hyperparameters - |
|
`n_users` : number of simulated users in the database; |
|
`n_ratings` : number of simulated rating events in the database. |
|
|
|
This is a fictional dataset based in the choice of an uniformly distributed random rating |
|
(from 1 to 5) for one of the simulated users of the recommender-system that is being designed in |
|
this research project. |
|
```python |
|
|
|
|
|
def read_uniform(self): |
|
|
|
n_users = 20 |
|
n_ratings = 10000 |
|
|
|
import random |
|
|
|
opo = pd.read_csv('../oportunidades.csv') |
|
df = [(random.randrange(n_users), random.randrange(len(opo)), random.randrange(1,5)) for i in range(n_ratings)] |
|
self.df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating']) |
|
``` |
|
|
|
Hyperparameters - |
|
n_users` : number of simulated users in the database; |
|
n_ratings` : number of simulated rating events in the database. |
|
|
|
This first LDA based dataset builds a model with K = `n_users` topics. LDA topics |
|
are used as proxies for simulated users with different clusters of interest. At first |
|
a random opportunity is chosen, than the amount of a randomly chosen topic inside the description |
|
is multiplied by five. The ceiling operation of this result is the rating that the fictional user |
|
will give to that opportunity. |
|
Because the amount of each topic predicted by the model is disollved among various topics, |
|
it is very rare to find an opportunity that has a higher LDA value. The consequence is that this dataset |
|
has really low volatility and the major part of ratings are equal to 1. |
|
```python |
|
|
|
def read_lda_topics(self): |
|
|
|
n_users = 20 |
|
n_ratings = 10000 |
|
|
|
import gensim |
|
import random |
|
import math |
|
|
|
opo = pd.read_csv('../oportunidades_results.csv') |
|
# opo = opo.iloc[np.where(opo['opo_brazil']=='Y')] |
|
|
|
try: |
|
lda_model = gensim.models.ldamodel.LdaModel.load(f'models/lda_model{n_users}.model') |
|
except: |
|
import generate_users |
|
generate_users.gen_model(n_users) |
|
lda_model = gensim.models.ldamodel.LdaModel.load(f'models/lda_model{n_users}.model') |
|
|
|
df = [] |
|
for i in range(n_ratings): |
|
opo_n = random.randrange(len(opo)) |
|
txt = opo.loc[opo_n,'opo_texto'] |
|
opo_bow = lda_model.id2word.doc2bow(txt.split()) |
|
topics = lda_model.get_document_topics(opo_bow) |
|
topics = {topic[0]:topic[1] for topic in topics} |
|
user = random.sample(topics.keys(), 1)[0] |
|
rating = math.ceil(topics[user]*5) |
|
df.append((user, opo_n, rating)) |
|
|
|
self.df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating']) |
|
|
|
def read_lda_rankings(self): |
|
|
|
n_users = 9 |
|
n_ratings = 1000 |
|
|
|
import gensim |
|
import random |
|
import math |
|
import tqdm |
|
|
|
opo = pd.read_csv('../oportunidades.csv') |
|
opo = opo.iloc[np.where(opo['opo_brazil']=='Y')] |
|
opo.index = range(len(opo)) |
|
|
|
path = f'models/output_linkedin_cle_lda_model_{n_users}_topics_symmetric_alpha_auto_beta' |
|
lda_model = gensim.models.ldamodel.LdaModel.load(path) |
|
|
|
df = [] |
|
|
|
pbar = tqdm.tqdm(total= n_ratings) |
|
for i in range(n_ratings): |
|
opo_n = random.randrange(len(opo)) |
|
txt = opo.loc[opo_n,'opo_texto'] |
|
opo_bow = lda_model.id2word.doc2bow(txt.split()) |
|
topics = lda_model.get_document_topics(opo_bow) |
|
topics = {topic[0]:topic[1] for topic in topics} |
|
|
|
prop = pd.DataFrame([topics], index=['prop']).T.sort_values('prop', ascending=True) |
|
prop['rating'] = range(1, len(prop)+1) |
|
prop['rating'] = prop['rating']/len(prop) |
|
prop['rating'] = prop['rating'].apply(lambda x: math.ceil(x*5)) |
|
prop.reset_index(inplace=True) |
|
|
|
prop = prop.sample(1) |
|
|
|
df.append((prop['index'].values[0], opo_n, prop['rating'].values[0])) |
|
pbar.update(1) |
|
|
|
pbar.close() |
|
self.df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating']) |
|
``` |
|
|
|
### Limitations and bias |
|
In this model we have faced some obstacles that we had overcome, but some of those, by the nature of the project, couldn't be totally solved. |
|
Due the fact that our dataset was build it by ourselves, there was no interaction yet between a user and the dataset, therefore we don't have |
|
realistic ratings which made us have to generate a simulation, making the results less believable. |
|
Also in this part of the project, we have used a database of scrappings of linkedin profiles. |
|
The problem is that the profiles that linkedin shows is biased, so the profiles that appears was geographically closed, or related to the users organization and email. |
|
|
|
## Training data |
|
To train the LDA model, we use a database of linkedin profiles |
|
## Training procedure |
|
### Preprocessing |
|
Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and |
|
optimize the training of the models. |
|
The following assumptions were considered: |
|
- The Data Entry base is obtained from the result of goal 4. |
|
- Labeling (Goal 4) is considered true for accuracy measurement purposes; |
|
- Preprocessing experiments compare accuracy in a shallow neural network (SNN); |
|
- Pre-processing was investigated for the classification goal. |
|
From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com) |
|
to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub. |
|
Several Python packages were used to develop the preprocessing code: |
|
#### Table 3: Python packages used |
|
| Objective | Package | |
|
|--------------------------------------------------------|--------------| |
|
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) | |
|
| Natural Language Processing | [nltk](https://pypi.org/project/nltk) | |
|
| Others data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata; | [numpy](https://pypi.org/project/numpy) | |
|
| Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) | |
|
| http library | [requests](https://pypi.org/project/requests) | |
|
| Training model | [scikit-learn](https://pypi.org/project/scikit-learn) | |
|
| Machine learning | [tensorflow](https://pypi.org/project/tensorflow) | |
|
| Machine learning | [keras](https://keras.io/) | |
|
| Translation from multiple languages to English | [translators](https://pypi.org/project/translators) | |
|
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different |
|
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2. |
|
#### Table 4: Preprocessing methods evaluated |
|
| id | Experiments | |
|
|--------|------------------------------------------------------------------------| |
|
| Base | Original Texts | |
|
| xp1 | Expand Contractions | |
|
| xp2 | Expand Contractions + Convert text to lowercase | |
|
| xp3 | Expand Contractions + Remove Punctuation | |
|
| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | |
|
| xp5 | xp4 + Stemming | |
|
| xp6 | xp4 + Lemmatization | |
|
| xp7 | xp4 + Stemming + Stopwords Removal | |
|
| xp8 | ap4 + Lemmatization + Stopwords Removal | |
|
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and |
|
evaluation of the first four bases (xp1, xp2, xp3, xp4). |
|
Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), stemming (xp6), |
|
stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8). |
|
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow |
|
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are |
|
shown in Table 5. |
|
#### Table 5: Results obtained in Preprocessing |
|
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght | |
|
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------| |
|
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 | |
|
| xp1 | Expand Contractions | 88,71% | 81,59% | 71,54% | 97,33% | 414,715 | 23768 | 5636 | |
|
| xp2 | Expand Contractions + Convert text to lowercase | 90,32% | 85,64% | 77,19% | 97,44% | 368,375 | 20322 | 5629 | |
|
| xp3 | Expand Contractions + Remove Punctuation | 91,94% | 87,73% | 79,66% | 98,72% | 386,650 | 22121 | 4950 | |
|
| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | 90,86% | 86,61% | 80,85% | 94,25% | 326,830 | 18616 | 4950 | |
|
| xp5 | xp4 + Stemming | 91,94% | 87,68% | 78,47% | 100,00% | 257,960 | 14319 | 4950 | |
|
| xp6 | xp4 + Lemmatization | 89,78% | 85,06% | 79,66% | 91,87% | 282,645 | 16194 | 4950 | |
|
| xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 | |
|
| xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 | |
|
Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time, |
|
less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice |
|
was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings, |
|
document-embedding). The training time is so close that it did not have such a large weight for the analysis. |
|
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made |
|
available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized). |
|
### Pretraining |
|
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size |
|
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer |
|
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01, |
|
learning rate warmup for 10,000 steps and linear decay of the learning rate after. |
|
## Evaluation results |
|
### Model training with Word2Vec embeddings |
|
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem. |
|
We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled |
|
data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the |
|
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN |
|
architecture and 88% for the LSTM architecture. |
|
#### Table 6: Results from Pre-trained WE + ML models |
|
| ML Model | Accuracy | F1 Score | Precision | Recall | |
|
|:--------:|:---------:|:---------:|:---------:|:---------:| |
|
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 | |
|
| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 | |
|
| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 | |
|
| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 | |
|
### Transformer-based implementation |
|
Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because |
|
of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences: |
|
the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the |
|
input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters |
|
without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks. |
|
For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences |
|
would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum |
|
length of 4096, only eight sentences will have their information shortened. |
|
To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination |
|
of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification |
|
models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more |
|
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4. |
|
This approach achieved adequate accuracy scores, above 82% in all implementation architectures. |
|
#### Table 7: Results from Pre-trained Longformer + ML models |
|
| ML Model | Accuracy | F1 Score | Precision | Recall | |
|
|:--------:|:---------:|:---------:|:---------:|:---------:| |
|
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 | |
|
| DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | |
|
| CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | |
|
| LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 | |
|
## Checkpoints |
|
- Examples |
|
- Implementation Notes |
|
- Usage Example |
|
- >>> |
|
- >>> ... |
|
## Config |
|
## Tokenizer |
|
|
|
## Benchmarks |
|
|
|
| | RMSE | MSE | MAE | FCP | |
|
|-----------------|-----------|-----------|-----------|-----------| |
|
| NormalPredictor | 1.820737 | 3.315084 | 1.475522 | 0.514134 | |
|
| BaselineOnly | 1.072843 | 1.150992 | 0.890233 | 0.556560 | |
|
| KNNBasic | 1.232248 | 1.518436 | 0.936799 | 0.648604 | |
|
| KNNWithMeans | 1.124166 | 1.263750 | 0.808329 | 0.597148 | |
|
| KNNWithZScore | 1.056550 | 1.116299 | 0.750004 | 0.669651 | |
|
| KNNBaseline | 1.134660 | 1.287454 | 0.825161 | 0.614270 | |
|
| SVD | 0.977468 | 0.955444 | 0.757485 | 0.723829 | |
|
| SVDpp | 0.843065 | 0.710758 | 0.670516 | 0.671737 | |
|
| NMF | 1.122684 | 1.260420 | 0.722101 | 0.688728 | |
|
| SlopeOne | 1.073552 | 1.152514 | 0.747142 | 0.651937 | |
|
| CoClustering | 1.293383 | 1.672838 | 1.007951 | 0.494174 | |
|
|
|
|
|
### BibTeX entry and citation info |
|
```bibtex |
|
@article{recommend22, |
|
author ={Jo\~{a}o Gabriel de Moraes Souza. and Daniel Oliveira Cajueiro. and Johnathan de O. Milagres. and Vin\´{i}cius de Oliveira Watanabe. and V\´{i}tor Bandeira Borges. and Victor Rafael Celestino.}, |
|
title ={A comprehensive review of recommendation systems: method, data, evaluation and coding}, |
|
booktitle ={xxxx}, |
|
year ={xxxx}, |
|
pages ={xxxx}, |
|
publisher ={xxxx}, |
|
organization ={xxxx}, |
|
doi ={xxxx}, |
|
isbn ={xxxx}, |
|
issn ={xxxx}, |
|
} |
|
``` |
|
<a href="https://huggingface.co/exbert/?model=bert-base-uncased"> |
|
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png"> |
|
</a> |