periquito-3B / README.md
wandgibaut's picture
Adding the Open Portuguese LLM Leaderboard Evaluation Results (#1)
661ec62 verified
|
raw
history blame
14.9 kB
metadata
language:
  - pt
license: apache-2.0
library_name: transformers
datasets:
  - wikimedia/wikipedia
metrics:
  - accuracy
model-index:
  - name: periquito-3B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ENEM Challenge (No Images)
          type: eduagarcia/enem_challenge
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 17.98
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BLUEX (No Images)
          type: eduagarcia-temp/BLUEX_without_images
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 21.14
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: OAB Exams
          type: eduagarcia/oab_exams
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 22.69
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Assin2 RTE
          type: assin2
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: f1_macro
            value: 43.01
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Assin2 STS
          type: eduagarcia/portuguese_benchmark
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: pearson
            value: 8.92
            name: pearson
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: FaQuAD NLI
          type: ruanchaves/faquad-nli
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: f1_macro
            value: 43.97
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HateBR Binary
          type: ruanchaves/hatebr
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 50.46
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: PT Hate Speech Binary
          type: hate_speech_portuguese
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 41.19
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: tweetSentBR
          type: eduagarcia-temp/tweetsentbr
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 47.96
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=wandgibaut/periquito-3B
          name: Open Portuguese LLM Leaderboard

Model Card for Model ID

Model Details

Model Description

Periquito-3B is a large language model (LLM) trained by Wandgibaut. It is built upon the OpenLlama-3B architecture and specifically fine-tuned using Portuguese Wikipedia (pt-br) data. This specialization makes it particularly adept at understanding and generating text in Brazilian Portuguese.

  • Developed by: Wandemberg Gibaut
  • Model type: Llama
  • Language(s) (NLP): Portuguese
  • License: Apache License 2.0
  • Finetuned from model [optional]: openlm-research/open_llama_3b

Loading the Weights with Hugging Face Transformers

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
model_path = 'wandgibaut/periquito-3B'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)
prompt = 'Q: Qual o maior animal terrestre?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32
)
print(tokenizer.decode(generation_output[0]))

For more advanced usage, please follow the transformers LLaMA documentation.

Evaluating with LM-Eval-Harness

The model can be evaluated with lm-eval-harness. However, we used a custom version, that has some translated tasks and the ENEM suit. This can be found in wandgibaut/lm-evaluation-harness-PTBR.

Dataset and Training

We finetunned the model on Wikipedia-pt dataset with LoRA, in Google's TPU-v3 in the Google's TPU Research program.

Evaluation

We evaluated OpenLLaMA on a wide range of tasks using lm-evaluation-harness. The LLaMA results are generated by running the original LLaMA model on the same evaluation metrics. We note that our results for the LLaMA model differ slightly from the original LLaMA paper, which we believe is a result of different evaluation protocols. Similar differences have been reported in this issue of lm-evaluation-harness. Additionally, we present the results of GPT-J, a 6B parameter model trained on the Pile dataset by EleutherAI.

hf-causal (pretrained=wandgibaut/periquito-3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
agnews_pt 0 acc 0.6184 ± 0.0056
boolq_pt 1 acc 0.6333 ± 0.0084
faquad 1 exact 7.9365
f1 45.6971
HasAns_exact 7.9365
HasAns_f1 45.6971
NoAns_exact 0.0000
NoAns_f1 0.0000
best_exact 7.9365
best_f1 45.6971
imdb_pt 0 acc 0.6338 ± 0.0068
sst2_pt 1 acc 0.6823 ± 0.0158
toldbr 0 acc 0.4629 ± 0.0109
f1_macro 0.3164

hf-causal (pretrained=wandgibaut/periquito-3B,dtype=float), limit: None, provide_description: False, num_fewshot: 3, batch_size: None

Task Version Metric Value Stderr
agnews_pt 0 acc 0.6242 ± 0.0056
boolq_pt 1 acc 0.6477 ± 0.0084
faquad 1 exact 34.9206
f1 70.3968
HasAns_exact 34.9206
HasAns_f1 70.3968
NoAns_exact 0.0000
NoAns_f1 0.0000
best_exact 34.9206
best_f1 70.3968
imdb_pt 0 acc 0.8408 ± 0.0052
sst2_pt 1 acc 0.7775 ± 0.0141
toldbr 0 acc 0.5143 ± 0.0109
f1_macro 0.5127

hf-causal (pretrained=wandgibaut/periquito-3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
enem 0 acc 0.1976 ± 0.0132
2009 0.2022 ± 0.0428
2016 0.1809 ± 0.0399
2015 0.1348 ± 0.0364
2016_2_ 0.2366 ± 0.0443
2017 0.2022 ± 0.0428
2013 0.1647 ± 0.0405
2012 0.2174 ± 0.0432
2011 0.2292 ± 0.0431
2010 0.2157 ± 0.0409
2014 0.1839 ± 0.0418
enem_2022 0 acc 0.2373 ± 0.0393
2022 0.2373 ± 0.0393
human-sciences 0.2703 ± 0.0740
mathematics 0.1818 ± 0.0842
natural-sciences 0.1538 ± 0.0722
languages 0.3030 ± 0.0812
enem_CoT 0 acc 0.1812 ± 0.0127
2009 0.1348 ± 0.0364
2016 0.1596 ± 0.0380
2015 0.1124 ± 0.0337
2016_2_ 0.1290 ± 0.0350
2017 0.2247 ± 0.0445
2013 0.1765 ± 0.0416
2012 0.2391 ± 0.0447
2011 0.1979 ± 0.0409
2010 0.2451 ± 0.0428
2014 0.1839 ± 0.0418
enem_CoT_2022 0 acc 0.2119 ± 0.0378
2022 0.2119 ± 0.0378
human-sciences 0.2703 ± 0.0740
mathematics 0.1818 ± 0.0842
natural-sciences 0.2308 ± 0.0843
languages 0.1515 ± 0.0634

hf-causal (pretrained=wandgibaut/periquito-3B,dtype=float), limit: None, provide_description: False, num_fewshot: 1, batch_size: None

Task Version Metric Value Stderr
enem 0 acc 0.1790 ± 0.0127
2009 0.1573 ± 0.0388
2016 0.2021 ± 0.0416
2015 0.1573 ± 0.0388
2016_2_ 0.1935 ± 0.0412
2017 0.2247 ± 0.0445
2013 0.1412 ± 0.0380
2012 0.1739 ± 0.0397
2011 0.1979 ± 0.0409
2010 0.1961 ± 0.0395
2014 0.1379 ± 0.0372
enem_2022 0 acc 0.1864 ± 0.0360
2022 0.1864 ± 0.0360
human-sciences 0.2432 ± 0.0715
mathematics 0.1364 ± 0.0749
natural-sciences 0.1154 ± 0.0639
languages 0.2121 ± 0.0723
enem_CoT 0 acc 0.2009 ± 0.0132
2009 0.2135 ± 0.0437
2016 0.2340 ± 0.0439
2015 0.1348 ± 0.0364
2016_2_ 0.2258 ± 0.0436
2017 0.2360 ± 0.0453
2013 0.1529 ± 0.0393
2012 0.1957 ± 0.0416
2011 0.2500 ± 0.0444
2010 0.1667 ± 0.0371
2014 0.1954 ± 0.0428
enem_CoT_2022 0 acc 0.2542 ± 0.0403
2022 0.2542 ± 0.0403
human-sciences 0.2703 ± 0.0740
mathematics 0.2273 ± 0.0914
natural-sciences 0.3846 ± 0.0973
languages 0.1515 ± 0.0634

Use Cases:

The model is suitable for text generation, language understanding, and various natural language processing tasks in Brazilian Portuguese.

Limitations:

Like many language models, Periquito-3B might exhibit biases present in its training data. Additionally, its performance is primarily optimized for Portuguese, potentially limiting its effectiveness with other languages.

Ethical Considerations:

Users are encouraged to use the model ethically, particularly by avoiding the generation of harmful or biased content.

Acknowledgment

We thank the Google TPU Research Cloud program for providing part of the computation resources.

Citation [optional]

If you found periquito-3B useful in your research or applications, please cite using the following BibTeX:

BibTeX:

@software{wandgibautperiquito3B,
  author = {Gibaut, Wandemberg},
  title = {Periquito-3B},
  month = Sep,
  year = 2023,
  url = {https://huggingface.co/wandgibaut/periquito-3B}
}

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Average 33.04
ENEM Challenge (No Images) 17.98
BLUEX (No Images) 21.14
OAB Exams 22.69
Assin2 RTE 43.01
Assin2 STS 8.92
FaQuAD NLI 43.97
HateBR Binary 50.46
PT Hate Speech Binary 41.19
tweetSentBR 47.96