File size: 3,179 Bytes
1c39b49 46872c7 1c39b49 14ef681 1c39b49 14ef681 1c39b49 14ef681 1c39b49 14ef681 1c39b49 14ef681 1c39b49 14ef681 1c39b49 14ef681 1c39b49 14ef681 1c39b49 fbe6807 1c39b49 fbe6807 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
datasets:
- unicamp-dl/mmarco
language:
- pt
pipeline_tag: text2text-generation
base_model: unicamp-dl/ptt5-v2-base
license: apache-2.0
---
## Introduction
MonoPTT5 models are T5 rerankers for the Portuguese language. Starting from [ptt5-v2 checkpoints](https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0), they were trained for 100k steps on a mixture of Portuguese and English data from the mMARCO dataset.
For further information on the training and evaluation of these models, please refer to our paper, [ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language](https://arxiv.org/abs/2008.09144).
## Usage
The easiest way to use our models is through the `rerankers` package. After installing the package using `pip install rerankers[transformers]`, the following code can be used as a minimal working example:
```python
from rerankers import Reranker
import torch
query = "O futebol é uma paixão nacional"
docs = [
"O futebol é superestimado e não deveria receber tanta atenção.",
"O futebol é uma parte essencial da cultura brasileira e une as pessoas.",
]
ranker = Reranker(
"unicamp-dl/monoptt5-base",
inputs_template="Pergunta: {query} Documento: {text} Relevante:",
dtype=torch.float32 # or bfloat16 if supported by your GPU
)
results = ranker.rank(query, docs)
print("Classification results:")
for result in results:
print(result)
# Loading T5Ranker model unicamp-dl/monoptt5-base
# No device set
# Using device cuda
# Using dtype torch.float32
# Loading model unicamp-dl/monoptt5-base, this might take a while...
# Using device cuda.
# Using dtype torch.float32.
# T5 true token set to ▁Sim
# T5 false token set to ▁Não
# Returning normalised scores...
# Inputs template set to Pergunta: {query} Documento: {text} Relevante:
# Classification results:
# document=Document(text='O futebol é uma parte essencial da cultura brasileira e une as pessoas.', doc_id=1, metadata={}) score=0.8186910152435303 rank=1
# document=Document(text='O futebol é superestimado e não deveria receber tanta atenção.', doc_id=0, metadata={}) score=0.008028557524085045 rank=2
```
For additional configurations and more advanced usage, consult the `rerankers` [GitHub repository](https://github.com/AnswerDotAI/rerankers).
## Citation
If you use our models, please cite:
```
@misc{piau2024ptt5v2,
title={ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language},
author={Marcos Piau and Roberto Lotufo and Rodrigo Nogueira},
year={2024},
eprint={2406.10806},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
``` |