unb-lamfo-nlp-mcti
/

NLP-Recommendation-MCTI

English

Recommendation

Model card Files Files and versions Community

Jmilagres commited on Dec 13, 2022

Commit

6139274

•

1 Parent(s): a8e69a1

Update README.md

Browse files

Files changed (1) hide show

README.md +293 -0

README.md CHANGED Viewed

@@ -1,3 +1,296 @@
 ---
 license: apache-2.0
 ---

 ---
+language: en
+tags:
+- Clsssification
 license: apache-2.0
+datasets:
+- tensorflow
+- numpy
+- keras
+- pandas
+- openpyxl
+- gensin
+- contractions
+- nltk
+- spacy
+thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png
 ---
+![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png)
+# MCTI Text Classification Task (uncased) DRAFT
+Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
+The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses
+on the task of Text Classification and explores different machine learning strategies to classify a small amount
+of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
+solutions were used to learn word relationships from other datasets with considerable similarity and larger scale.
+Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
+models were applied to improve the understanding of each sentence.
+## According to the abstract,
+Compared to the 81% baseline accuracy rate based on available datasets and the 85% accuracy rate achieved using a
+Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 93%, according to
+["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318).
+## Model description
+Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
+bibendum cursus. Nunc volutpat vitae neque ut bibendum:
+- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+  nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+  consectetur adipiscing elit.
+- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+  nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+  consectetur adipiscing elit.
+Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
+bibendum cursus. Nunc volutpat vitae neque ut bibendum.
+![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png)
+## Model variations
+With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
+strategy under the assumption that small data available for training was insufficient for adequate embedding training.
+In this context, we considered two approaches:
+   i) pre-training wordembeddings using similar datasets for text classification;
+   ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
+XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
+also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
+Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
+two models.
+Other 24 smaller models are released afterward.
+The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
+#### Table 1:
+| Model                        | #params | Language |
+|------------------------------|:-------:|:--------:|
+| [`mcti-base-uncased`]        | 110M    | English  |
+| [`mcti-large-uncased`]       | 340M    | English  |
+| [`mcti-base-cased`]          | 110M    | English  |
+| [`mcti-large-cased`]         | 110M    | Chinese  |
+| [`-base-multilingual-cased`] | 110M    | Multiple |
+#### Table 2:
+| Dataset                              | Compatibility to base* |
+|--------------------------------------|:----------------------:|
+| Labeled MCTI                         | 100%                   |
+| Full MCTI                            | 100%                   |
+| BBC News Articles                    | 56.77%                 |
+| New unlabeled MCTI                   | 75.26%                 |
+## Intended uses
+You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
+be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
+fine-tuned versions of a task that interests you.
+Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
+to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
+generation you should look at model like XXX.
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
+>>> unmasker("Hello I'm a [MASK] model.")
+[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
+  'score': 0.1073106899857521,
+  'token': 4827,
+  'token_str': 'fashion'},
+ {'sequence': "[CLS] hello i'm a fine model. [SEP]",
+  'score': 0.027095865458250046,
+  'token': 2986,
+  'token_str': 'fine'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import BertTokenizer, BertModel
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertModel.from_pretrained("bert-base-uncased")
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```python
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = TFBertModel.from_pretrained("bert-base-uncased")
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+### Limitations and bias
+This model is uncased: it does not make a difference between english
+and English.
+Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
+predictions:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
+>>> unmasker("The man worked as a [MASK].")
+[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
+  'score': 0.09747550636529922,
+  'token': 10533,
+  'token_str': 'carpenter'},
+ {'sequence': '[CLS] the man worked as a salesman. [SEP]',
+  'score': 0.037680890411138535,
+  'token': 18968,
+  'token_str': 'salesman'}]
+>>> unmasker("The woman worked as a [MASK].")
+[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
+  'score': 0.21981462836265564,
+  'token': 6821,
+  'token_str': 'nurse'},
+ {'sequence': '[CLS] the woman worked as a cook. [SEP]',
+  'score': 0.03042375110089779,
+  'token': 5660,
+  'token_str': 'cook'}]
+```
+This bias will also affect all fine-tuned versions of this model.
+## Training data
+The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
+unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
+headers).
+## Training procedure
+### Preprocessing
+Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
+optimize the training of the models.
+The following assumptions were considered:
+- The Data Entry base is obtained from the result of goal 4.
+- Labeling (Goal 4) is considered true for accuracy measurement purposes;
+- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
+- Pre-processing was investigated for the classification goal.
+From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
+to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
+Several Python packages were used to develop the preprocessing code:
+#### Table 3: Python packages used
+|                         Objective                      |   Package    |
+|--------------------------------------------------------|--------------|
+| Resolve contractions and slang usage in text           | [contractions](https://pypi.org/project/contractions) |
+| Natural Language Processing                            | [nltk](https://pypi.org/project/nltk)         |
+| Others data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata;    | [numpy](https://pypi.org/project/numpy)        |
+| Data manipulation and analysis                         | [pandas](https://pypi.org/project/pandas)       |
+| http library                                           | [requests](https://pypi.org/project/requests)     |
+| Training model                                         | [scikit-learn](https://pypi.org/project/scikit-learn) |
+| Machine learning                                       | [tensorflow](https://pypi.org/project/tensorflow)   |
+| Machine learning                                       | [keras](https://keras.io/)        |
+| Translation from multiple languages to English         | [translators](https://pypi.org/project/translators)  |
+As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
+bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
+#### Table 4: Preprocessing methods evaluated
+|  id    |                   Experiments                                          |
+|--------|------------------------------------------------------------------------|
+| Base   | Original Texts                                                         |
+| xp1    | Expand Contractions                                                    |
+| xp2    | Expand Contractions + Convert text to lowercase                        |
+| xp3    | Expand Contractions + Remove Punctuation                               |
+| xp4    | Expand Contractions + Remove Punctuation + Convert text to lowercase   |
+| xp5    | xp4 + Stemming                                                         |
+| xp6    | xp4 + Lemmatization                                                    |
+| xp7    | xp4 + Stemming + Stopwords Removal                                     |
+| xp8    | ap4 + Lemmatization + Stopwords Removal                                |
+First, the treatment of punctuation and  capitalization was evaluated. This phase  resulted in the construction and
+evaluation of the first four bases (xp1, xp2, xp3, xp4).
+Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5),  stemming (xp6),
+stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8).
+All eight bases were evaluated to classify the  eligibility of the opportunity, through the  training of a shallow
+neural network  (SNN – Shallow Neural Network).  The metrics for the eight bases were evaluated. The results are
+shown in Table 5.
+#### Table 5: Results obtained in Preprocessing
+|  id    |                   Experiment                                           | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
+|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
+| Base   | Original Texts                                                         |  89,78%  |  84,20%  | 79,09% |   90,95%  |  417,772 |   23788  |   5636     |
+| xp1    | Expand Contractions                                                    |  88,71%  |  81,59%  | 71,54% |   97,33%  |  414,715 |   23768  |   5636     |
+| xp2    | Expand Contractions + Convert text to lowercase                        |  90,32%  |  85,64%  | 77,19% |   97,44%  |  368,375 |   20322  |   5629     |
+| xp3    | Expand Contractions + Remove Punctuation                               |  91,94%  |  87,73%  | 79,66% |   98,72%  |  386,650 |   22121  |   4950     |
+| xp4    | Expand Contractions + Remove Punctuation + Convert text to lowercase   |  90,86%  |  86,61%  | 80,85% |   94,25%  |  326,830 |   18616  |   4950     |
+| xp5    | xp4 + Stemming                                                         |  91,94%  |  87,68%  | 78,47% |  100,00%  |  257,960 |   14319  |   4950     |
+| xp6    | xp4 + Lemmatization                                                    |  89,78%  |  85,06%  | 79,66% |   91,87%  |  282,645 |   16194  |   4950     |
+| xp7    | xp4 + Stemming + Stopwords Removal                                     |  92,47%  |  88,46%  | 79,66% |  100,00%  |  210,320 |   14212  |   2817     |
+| xp8    | ap4 + Lemmatization + Stopwords Removal                                |  92,47%  |  88,46%  | 79,66% |  100,00%  |  225,580 |   16081  |   2726     |
+Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time,
+less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice
+was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
+document-embedding). The training time is so close that it did not have such a large weight for the analysis.
+As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
+available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
+### Pretraining
+The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
+of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
+used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
+learning rate warmup for 10,000 steps and linear decay of the learning rate after.
+## Evaluation results
+### Model training with Word2Vec embeddings
+Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
+We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
+data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
+obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
+architecture and 88% for the LSTM architecture.
+#### Table 6: Results from Pre-trained WE + ML models
+| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN       |  0.8269   |  0.8545   |  0.8392   |  0.8712   |
+| DNN      |  0.7115   |  0.7794   |  0.7255   |  0.8485   |
+| CNN      |  0.8654   |  0.9083   |  0.8486   |  0.9773   |
+| LSTM     |  0.8846   |  0.9139   |  0.9056   |  0.9318   |
+### Transformer-based implementation
+Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
+of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
+the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
+input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
+without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
+For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
+would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
+length of 4096, only eight sentences will have their information shortened.
+To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
+of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
+models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
+computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
+This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
+#### Table 7: Results from Pre-trained Longformer + ML models
+| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN       |  0.8269   |  0.8754   |0.7950     |  0.9773   |
+| DNN      |  0.8462   |  0.8776   |0.8474     |  0.9123   |
+| CNN      |  0.8462   |  0.8776   |0.8474     |  0.9123   |
+| LSTM     |  0.8269   |  0.8801   |0.8571     |  0.9091   |
+## Checkpoints
+- Examples
+- Implementation Notes
+- Usage Example
+- >>>
+- >>> ...
+## Config
+## Tokenizer
+## Benchmarks
+### BibTeX entry and citation info
+```bibtex
+@conference{webist22,
+author       ={Carlos Rocha. and Marcos Dib. and Li Weigang. and Andrea Nunes. and Allan Faria. and Daniel Cajueiro.
+               and Maísa {Kely de Melo}. and Victor Celestino.},
+title        ={Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data},
+booktitle    ={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST,},
+year         ={2022},
+pages        ={201-213},
+publisher    ={SciTePress},
+organization ={INSTICC},
+doi          ={10.5220/0011527700003318},
+isbn         ={978-989-758-613-2},
+issn         ={2184-3252},
+}
+```
+<a href="https://huggingface.co/exbert/?model=bert-base-uncased">
+	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
+</a>