metadata

language: en
tags:
  - Recommendation
license: apache-2.0
datasets:
  - surprise
  - numpy
  - keras
  - pandas
thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png

MCTI Recommendation Task (uncased) DRAFT

Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.

The model NLP MCTI Recommendation Multi is part of the project Research Financing Product Portfolio (FPP) focuses on the task of Recommendation and explores different machine learning strategies that provide suggestions of items that are likely to be handy for a particular individual. Several methods were faced against each other to compare the error estimatives. Using LDA model, a simulated dataset was created.

According to the abstract,

XXXXX "Using transfer learning to classify long unstructured texts with small amounts of labeled data".

Model description

The surprise library provides 11 classifier models that try to predict the classification of training data based on several different collaborative-filtering techniques. The models provided with a brief explanation in English are mentioned below, for more information please refer to the package documentation.

random_pred.NormalPredictor: Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

baseline_only.BaselineOnly: Algorithm predicting the baseline estimate for given user and item.

knns.KNNBasic: A basic collaborative filtering algorithm.

knns.KNNWithMeans: A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

knns.KNNWithZScore: A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

knns.KNNBaseline: A basic collaborative filtering algorithm taking into account a baseline rating.

matrix_factorization.SVD: The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

matrix_factorization.SVDpp: The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

matrix_factorization.NMF: A collaborative filtering algorithm based on Non-negative Matrix Factorization.

slope_one.SlopeOne: A simple yet accurate collaborative filtering algorithm.

co_clustering.CoClustering: A collaborative filtering algorithm based on co-clustering.

Every model was used and evaluated. When faced with each other different methods presented different error estimatives.

Intended uses

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions of a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like XXX.

How to use

The datasets for collaborative filtering must be: - The dataframe containing the ratings. - It must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings, in this order.

>>> import pandas as pd
>>> import numpy as np

class Data:

The databases (ml_100k, ml_1m and jester) are built-in the surprise package for collaborative-filtering

  def_init_(self):
    self.available_databases=['ml_100k', 'ml_1m','jester', 'lda_topics', 'lda_rankings', 'uniform']
   def show_available_databases(self):
        print('The avaliable database are:')
        for i,database in enumerate(self.available_databases):
            print(str(i)+': '+database)            
        
    def read_data(self,database_name):
        self.database_name=database_name
        self.the_data_reader= getattr(self, 'read_'+database_name.lower())
        self.the_data_reader()   

    def read_ml_100k(self):

        from surprise import Dataset
        data = Dataset.load_builtin('ml-100k')
        self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
        self.df.drop(columns=['timestamp'],inplace=True)
        self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True)

    def read_ml_1m(self):

        from surprise import Dataset
        data = Dataset.load_builtin('ml-1m')
        self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
        self.df.drop(columns=['timestamp'],inplace=True)
        self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True)

    def read_jester(self):

        from surprise import Dataset
        data = Dataset.load_builtin('jester')
        self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
        self.df.drop(columns=['timestamp'],inplace=True)
        self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True)

Hyperparameters -

n_users : number of simulated users in the database;

n_ratings : number of simulated rating events in the database.

This is a fictional dataset based in the choice of an uniformly distributed random rating(from 1 to 5) for one of the simulated users of the recommender-system that is being designed in this research project.


        
    def read_uniform(self):

         n_users = 20
        n_ratings = 10000
        
        import random
        
        opo = pd.read_csv('../oportunidades.csv')
        df = [(random.randrange(n_users), random.randrange(len(opo)), random.randrange(1,5)) for i in range(n_ratings)]
        self.df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating'])

Hyperparameters -

n_users` : number of simulated users in the database;

n_ratings` : number of simulated rating events in the database.

This first LDA based dataset builds a model with K = n_users topics. LDA topics are used as proxies for simulated users with different clusters of interest. At first a random opportunity is chosen, than the amount of a randomly chosen topic inside the description is multiplied by five. The ceiling operation of this result is the rating that the fictional user will give to that opportunity. Because the amount of each topic predicted by the model is disollved among various topics, it is very rare to find an opportunity that has a higher LDA value. The consequence is that this dataset has really low volatility and the major part of ratings are equal to 1.


    def read_lda_topics(self):

        n_users = 20
        n_ratings = 10000
        
        import gensim
        import random
        import math
        
        opo = pd.read_csv('../oportunidades_results.csv')
        # opo = opo.iloc[np.where(opo['opo_brazil']=='Y')]
        
        try:
            lda_model = gensim.models.ldamodel.LdaModel.load(f'models/lda_model{n_users}.model')
        except:
            import generate_users
            generate_users.gen_model(n_users)
            lda_model = gensim.models.ldamodel.LdaModel.load(f'models/lda_model{n_users}.model')

        df = []
        for i in range(n_ratings):
            opo_n = random.randrange(len(opo))
            txt = opo.loc[opo_n,'opo_texto']
            opo_bow = lda_model.id2word.doc2bow(txt.split())
            topics = lda_model.get_document_topics(opo_bow)
            topics = {topic[0]:topic[1] for topic in topics}
            user = random.sample(topics.keys(), 1)[0]
            rating = math.ceil(topics[user]*5)
            df.append((user, opo_n, rating))

        self.df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating'])
        
    def read_lda_rankings(self):

        n_users = 9
        n_ratings = 1000
        
        import gensim
        import random
        import math
        import tqdm
        
        opo = pd.read_csv('../oportunidades.csv')
        opo = opo.iloc[np.where(opo['opo_brazil']=='Y')]
        opo.index = range(len(opo))
        
        path = f'models/output_linkedin_cle_lda_model_{n_users}_topics_symmetric_alpha_auto_beta'
        lda_model = gensim.models.ldamodel.LdaModel.load(path)
        
        df = []
        
        pbar = tqdm.tqdm(total= n_ratings)
        for i in range(n_ratings):
            opo_n = random.randrange(len(opo))
            txt = opo.loc[opo_n,'opo_texto']
            opo_bow = lda_model.id2word.doc2bow(txt.split())
            topics = lda_model.get_document_topics(opo_bow)
            topics = {topic[0]:topic[1] for topic in topics}

            prop = pd.DataFrame([topics], index=['prop']).T.sort_values('prop', ascending=True)
            prop['rating'] = range(1, len(prop)+1)
            prop['rating'] = prop['rating']/len(prop)
            prop['rating'] = prop['rating'].apply(lambda x: math.ceil(x*5))
            prop.reset_index(inplace=True)

            prop = prop.sample(1)

            df.append((prop['index'].values[0], opo_n, prop['rating'].values[0]))
            pbar.update(1)

        pbar.close() 
        self.df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating'])

Limitations and bias

In this model we have faced some obstacles that we had overcome, but some of those, by the nature of the project, couldn't be totally solved. Databases containing profiles of possible users of the planned prototype are not available. For this reason, it was necessary to carry out simulations in order to represent the interests of these users, so that the recommendation system could be modeled. A simulation of clusters of latent interests was realized, based on topics present in the texts describing financial products. Due the fact that the dataset was build it by ourselves, there was no interaction yet between a user and the dataset, therefore we don't have realistic ratings, making the results less believable.

Later on, we have used a database of scrappings of linkedin profiles. The problem is that the profiles that linkedin shows is biased, so the profiles that appears was geographically closed, or related to the users organization and email.

Training data

To train the Latent Dirichlet allocation (LDA) model, it was used a database of a scrapping of Researchers profiles on Linkedin

Training procedure

Evaluation results

Checkpoints

Example


data=Data()
data.show_available_databases()
data.read_data('ml_100k')
method=Method(data.df)  
method.show_methods()
method.run('surprise.KNNWithMeans')
predictions_df=method.predictions_df
evaluator=Evaluator(predictions_df)
evaluator.show_evaluators()
evaluator.run('surprise.mse')

The avaliable database are: 0: ml_100k

1: ml_1m

2: jester

3: lda_topics

4: lda_rankings

5: uniform

The avaliable methods are:

0: surprise.NormalPredictor

1: surprise.BaselineOnly

2: surprise.KNNBasic

3: surprise.KNNWithMeans

4: surprise.KNNWithZScore

5: surprise.KNNBaseline

6: surprise.SVD

7: surprise.SVDpp

8: surprise.NMF

9: surprise.SlopeOne

10: surprise.CoClustering

Computing the msd similarity matrix...

Done computing similarity matrix.

The avaliable evaluators are:

0: surprise.rmse

1: surprise.mse

2: surprise.mae

3: surprise.fcp

MSE: 0.9146

Next, we have the code that builds the table with the accuracy metrics for all rating prediction models built-in the surprise package. The expected return of this function is a pandas dataframe (11x4) corresponding to the 11 classifier models and 4 different accuracy metrics.


def model_table(label):

    import tqdm
    
    table = pd.DataFrame()
    
    data=Data()
    data.read_data(label)
    
    method=Method(data.df)
    
    
    for m in method.available_methods:
        print(m)
        method.run(m)
        predictions_df=method.predictions_df
        evaluator=Evaluator(predictions_df)
        
        metrics = []
        
        for e in evaluator.available_evaluators:
            evaluator.run(e)
            metrics.append(evaluator.acc)
            
        table = table.append(dict(zip(evaluator.available_evaluators,metrics)),ignore_index=True)
        
    table.index = [x[9:] for x in method.available_methods]
    table.columns = [x[9:].upper() for x in evaluator.available_evaluators]
            
    return table


import sys, os

sys.stdout = open(os.devnull, 'w') # Codigo para desativar os prints

uniform = model_table('uniform')  
#topics = model_table('lda_topics')
ranking = model_table('lda_rankings')

sys.stdout = sys.__stdout__ # Codigo para reativar os prints

Usage Example

In this section it will be explained how the recommendation is made for the user


import gradio as gr
import random
import pandas as pd

opo = pd.read_csv('oportunidades_results.csv', lineterminator='\n')
# opo = opo.iloc[np.where(opo['opo_brazil']=='Y')]
simulation = pd.read_csv('simulation2.csv')
userID = max(simulation['userID']) + 1

def build_display_text(opo_n):
    
    title = opo.loc[opo_n]['opo_titulo']
    link = opo.loc[opo_n]['link']
    summary = opo.loc[opo_n]['facebook-bart-large-cnn_results']

    display_text = f"**{title}**\n\nURL:\n{link}\n\nSUMMARY:\n{summary}"

    return display_text

opo_n_one = random.randrange(len(opo))
opo_n_two = random.randrange(len(opo))
opo_n_three = random.randrange(len(opo))
opo_n_four = random.randrange(len(opo))

evaluated = []

def predict_next(option, nota):
    global userID
    global opo_n_one
    global opo_n_two
    global opo_n_three
    global opo_n_four
    global evaluated
    global opo
    global simulation

    selected = [opo_n_one, opo_n_two, opo_n_three, opo_n_four][int(option)-1]

    simulation = simulation.append({'userID': userID, 'itemID': selected, 'rating': nota}, ignore_index=True)
    evaluated.append(selected)
    
    from surprise import Reader
    reader = Reader(rating_scale=(1, 5))

    from surprise import Dataset
    data = Dataset.load_from_df(simulation[['userID', 'itemID', 'rating']], reader)
    trainset = data.build_full_trainset()

    from surprise import SVDpp
    svdpp = SVDpp()
    svdpp.fit(trainset)

    items = list()
    est = list()

    for i in range(len(opo)):
        if i not in evaluated:
            items.append(i)
            est.append(svdpp.predict(userID, i).est)

    opo_n_one = items[est.index(sorted(est)[-1])]
    opo_n_two = items[est.index(sorted(est)[-2])]
    opo_n_three = items[est.index(sorted(est)[-3])]
    opo_n_four = items[est.index(sorted(est)[-4])]

    return build_display_text(opo_n_one), build_display_text(opo_n_two), build_display_text(opo_n_three), build_display_text(opo_n_four)


with gr.Blocks() as demo:
    with gr.Row():
        one_opo = gr.Textbox(build_display_text(opo_n_one), label='Oportunidade 1')
        two_opo = gr.Textbox(build_display_text(opo_n_two), label='Oportunidade 2')

    with gr.Row():
        three_opo = gr.Textbox(build_display_text(opo_n_three), label='Oportunidade 3')
        four_opo = gr.Textbox(build_display_text(opo_n_four), label='Oportunidade 4')

    with gr.Row():
        option = gr.Radio(['1', '2', '3', '4'], label='Opção', value = '1')

    with gr.Row():
        nota = gr.Slider(1,5,step=1,label="Nota 1")

    with gr.Row():
        confirm = gr.Button("Confirmar")

        confirm.click(fn=predict_next,
               inputs=[option, nota],
               outputs=[one_opo, two_opo, three_opo, four_opo])

if __name__ == "__main__":
    demo.launch() 
## Benchmarks

```python

# LDA-GENERATED DATASET
ranking

	RMSE	MSE	MAE	FCP
NormalPredictor	1.820737	3.315084	1.475522	0.514134
BaselineOnly	1.072843	1.150992	0.890233	0.556560
KNNBasic	1.232248	1.518436	0.936799	0.648604
KNNWithMeans	1.124166	1.263750	0.808329	0.597148
KNNWithZScore	1.056550	1.116299	0.750004	0.669651
KNNBaseline	1.134660	1.287454	0.825161	0.614270
SVD	0.977468	0.955444	0.757485	0.723829
SVDpp	0.843065	0.710758	0.670516	0.671737
NMF	1.122684	1.260420	0.722101	0.688728
SlopeOne	1.073552	1.152514	0.747142	0.651937
CoClustering	1.293383	1.672838	1.007951	0.494174


# BENCHMARK DATASET
uniform

	RMSE	MSE	MAE	FCP
NormalPredictor	1.508925	2.276854	1.226758	0.503723
BaselineOnly	1.153331	1.330172	1.022732	0.506818
KNNBasic	1.205058	1.452165	1.026591	0.501168
KNNWithMeans	1.202024	1.444862	1.028149	0.503527
KNNWithZScore	1.216041	1.478756	1.041070	0.501582
KNNBaseline	1.225609	1.502117	1.048107	0.498198
SVD	1.176273	1.383619	1.013285	0.502067
SVDpp	1.192619	1.422340	1.018717	0.500909
NMF	1.338216	1.790821	1.120604	0.492944
SlopeOne	1.224219	1.498713	1.047170	0.494298
CoClustering	1.223020	1.495778	1.033699	0.518509

BibTeX entry and citation info

@article{recommend22,
author       ={Jo\~{a}o Gabriel de Moraes Souza. and Daniel Oliveira Cajueiro. and Johnathan de O. Milagres. and Vin\´{i}cius de Oliveira Watanabe. and V\´{i}tor Bandeira Borges. and Victor Rafael Celestino.},
title        ={A comprehensive review of recommendation systems: method, data, evaluation and coding},
booktitle    ={xxxx},
year         ={xxxx},
pages        ={xxxx},
publisher    ={xxxx},
organization ={xxxx},
doi          ={xxxx},
isbn         ={xxxx},
issn         ={xxxx},
}