File size: 9,769 Bytes

---
datasets:
- eduagarcia/CrawlPT_dedup
language:
- pt
pipeline_tag: fill-mask
model-index:
- name: RoBERTaCrawlPT-base
  results:
  - task:
      type: token-classification
    dataset:
      type: lener_br
      name: lener_br
      split: test
    metrics:
    - type: seqeval
      value: 0.8924
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Coarse
      config: UlyssesNER-Br-PL-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8822
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Fine
      config: UlyssesNER-Br-PL-fine
      split: test
    metrics:
    - type: seqeval
      value: 0.8658
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: FGV-STF
      config: fgv-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.7988
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: RRIP
      config: rrip
      split: test
    metrics:
    - type: seqeval
      value: 0.8280
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: PortuLex
      split: test
    metrics:
    - type: seqeval
      value: 0.8483
      name: Average F1
      args:
        scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaCrawlPT-base

RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base).
This model is part of the [RoBERTaLexPT](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) [work](https://aclanthology.org/2024.propor-1.38/).

- **Language(s) (NLP):** Portuguese (pt-BR mainly)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** https://aclanthology.org/2024.propor-1.38/

## Generic Evaluation

TO-DO...

## Legal Evaluation

The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:

| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
| **Ours**                                                                   |           |                 |             |           |                 |
| RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
| RoBERTaLegalPT-base (Trained on LegalPT)    | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
| **RoBERTaCrawlPT-base (this)**  (Trained on CrawlPT)    | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
| [RoBERTaLexPT-base](https://huggingface.co/eduagarcia/RoBERTaLexPT-base) (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |


## Training Details

RoBERTaCrawlPT is pretrained on:
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac_dedup), CC100 PT subset, [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).

### Training Procedure

Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.


This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.

#### Preprocessing

We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.

#### Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.  
The weight initialization is random.  
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.  
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.  

For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):


| **Hyperparameter**     | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers       |               12 |
| Hidden size            |              768 |
| FFN inner hidden size  |             3072 |
| Attention heads        |               12 |
| Attention head size    |               64 |
| Dropout                |              0.1 |
| Attention dropout      |              0.1 |
| Warmup steps           |               6k |
| Peak learning rate     |             4e-4 |
| Batch size             |             2048 |
| Weight decay           |             0.01 |
| Maximum training steps |            62.5k |
| Learning rate decay    |           Linear |
| AdamW $$\epsilon$$     |             1e-6 |
| AdamW $$\beta_1$$      |              0.9 |
| AdamW $$\beta_2$$      |             0.98 |
| Gradient clipping      |              0.0 |

## Citation

```bibtex
@inproceedings{garcia-etal-2024-robertalexpt,
    title = "{R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese",
    author = "Garcia, Eduardo A. S.  and
      Silva, Nadia F. F.  and
      Siqueira, Felipe  and
      Albuquerque, Hidelberg O.  and
      Gomes, Juliana R. S.  and
      Souza, Ellen  and
      Lima, Eliomar A.",
    editor = "Gamallo, Pablo  and
      Claro, Daniela  and
      Teixeira, Ant{\'o}nio  and
      Real, Livy  and
      Garcia, Marcos  and
      Oliveira, Hugo Gon{\c{c}}alo  and
      Amaro, Raquel",
    booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
    month = mar,
    year = "2024",
    address = "Santiago de Compostela, Galicia/Spain",
    publisher = "Association for Computational Lingustics",
    url = "https://aclanthology.org/2024.propor-1.38",
    pages = "374--383",
}
```

## Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).