Token Classification
Transformers
Safetensors
French
camembert
Inference Endpoints
bourdoiscatie's picture
Update README.md
b5fd176 verified
metadata
license: cc-by-4.0
base_model: camembert/camembert-large
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: NERmembert-large-3entities
    results: []
datasets:
  - CATIE-AQ/frenchNER_3entities
language:
  - fr
widget:
  - text: >-
      Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au
      Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus
      par le designer Sylvain Boyer avec les agences Royalties & Ecobranding.
      Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique
      et Marianne, symbolisée par un visage de femme mais privée de son bonnet
      phrygien caractéristique. La typographie dessinée fait référence à l'Art
      déco, mouvement artistique des années 1920, décennie pendant laquelle ont
      eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la
      première fois, ce logo sera unique pour les Jeux olympiques et les Jeux
      paralympiques.
library_name: transformers
pipeline_tag: token-classification
co2_eq_emissions: 90

NERmembert-large-3entities

Model Description

We present NERmembert-large-3entities, which is a CamemBERT large fine-tuned for the Name Entity Recognition task for the French language on five French NER datasets for 3 entities (LOC, PER, ORG).
All these datasets were concatenated and cleaned into a single dataset that we called frenchNER_3entities.
This represents a total of over 420,264 rows, of which 346,071 are for training, 32,951 for validation and 41,242 for testing.
Our methodology is described in a blog post available in English or French.

Dataset

The dataset used is frenchNER_3entities, which represents ~420k sentences labeled in 4 categories:

Label Examples
PER "La Bruyère", "Gaspard de Coligny", "Wittgenstein"
ORG "UTBM", "American Airlines", "id Software"
LOC "République du Cap-Vert", "Créteil", "Bordeaux"

The distribution of the entities is as follows:


Splits

O

PER

LOC

ORG

train

8,398,765

327,393

303,722

151,490

validation

592,815

34,127

30,279

18,743

test

773,871

43,634

39,195

21,391

Evaluation results

The evaluation was carried out using the evaluate python package.

frenchNER_3entities

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

Jean-Baptiste/camembert-ner

0.941

0.883

0.658

cmarkea/distilcamembert-base-ner

0.942

0.882

0.647

NERmembert-base-3entities

0.966

0.940

0.876

NERmembert-large-3entities (this model)

0.969

0.947

0.890

NERmembert-base-4entities

0.951

0.894

0.671

NERmembert-large-4entities

0.958

0.901

0.685
Full results

Model

Metrics

PER

LOC

ORG

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.918

0.860

0.831

0.992

0.974

Recall

0.964

0.908

0.544

0.964

0.948
F1
0.941

0.883

0.658

0.978

0.961

cmarkea/distilcamembert-base-ner

Precision

0.929

0.861

0.813

0.991

0.974

Recall

0.956

0.905

0.956

0.965

0.948
F1
0.942

0.882

0.647

0.978

0.961

NERmembert-base-3entities

Precision

0.961

0.935

0.877

0.995

0.986

Recall

0.972

0.946

0.876

0.994

0.986
F1
0.966

0.940

0.876

0.994

0.986

NERmembert-large-3entities (this model)

Precision

0.966

0.944

0.884

0.996

0.987

Recall

0.950

0.972

0.896

0.994

0.987
F1
0.969

0.947

0.890

0.995

0.987

NERmembert-base-4entities

Precision

0.946

0.884

0.859

0.993

0.971

Recall

0.955

0.904

0.550

0.993

0.971
F1
0.951

0.894

0.671

0.988

0.971

NERmembert-large-4entities

Precision

0.955

0.896

0.866

0.983

0.974

Recall

0.960

0.906

0.567

0.994

0.974
F1
0.958

0.901

0.685

0.988

0.974

In detail:

multiconer

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

Jean-Baptiste/camembert-ner

0.940

0.761

0.723

cmarkea/distilcamembert-base-ner

0.921

0.748

0.694

NERmembert-base-3entities

0.960

0.887

0.876

NERmembert-large-3entities (this model)

0.965

0.902

0.896

NERmembert-base-4entities

0.960

0.890

0.867

NERmembert-large-4entities

0.969

0.919

0.904
Full results

Model

Metrics

PER

LOC

ORG

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.908

0.717

0.753

0.987

0.947

Recall

0.975

0.811

0.696

0.878

0.880
F1
0.940

0.761

0.723

0.929

0.912

cmarkea/distilcamembert-base-ner

Precision

0.885

0.738

0.737

0.983

0.943

Recall

0.960

0.759

0.655

0.882

0.877
F1
0.921

0.748

0.694

0.930

0.909

NERmembert-base-3entities

Precision

0.957

0.894

0.876

0.986

0.972

Recall

0.962

0.880

0.878

0.985

0.972
F1
0.960

0.887

0.876

0.985

0.972

NERmembert-large-3entities (this model)

Precision

0.960

0.903

0.916

0.987

0.976

Recall

0.969

0.900

0.877

0.987

0.976
F1
0.965

0.902

0.896

0.987

0.976

NERmembert-base-4entities

Precision

0.954

0.893

0.851

0.988

0.972

Recall

0.967

0.887

0.883

0.984

0.972
F1
0.960

0.890

0.867

0.986

0.972

NERmembert-large-4entities

Precision

0.964

0.922

0.904

0.990

0.978

Recall

0.975

0.917

0.904

0.988

0.978
F1
0.969

0.919

0.904

0.989

0.978

multinerd

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

Jean-Baptiste/camembert-ner

0.962

0.934

0.888

cmarkea/distilcamembert-base-ner

0.972

0.938

0.884

NERmembert-base-3entities

0.985

0.973

0.938

NERmembert-large-3entities (this model)

0.987

0.979

0.953

NERmembert-base-4entities

0.985

0.973

0.938

NERmembert-large-4entities

0.987

0.976

0.948
Full results

Model

Metrics

PER

LOC

ORG

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.931

0.893

0.827

0.999

0.988

Recall

0.994

0.980

0.959

0.973

0.974
F1
0.962

0.934

0.888

0.986

0.981

cmarkea/distilcamembert-base-ner

Precision

0.954

0.908

0.817

0.999

0.990

Recall

0.991

0.969

0.963

0.975

0.975
F1
0.972

0.938

0.884

0.987

0.983

NERmembert-base-3entities

Precision

0.974

0.965

0.910

0.999

0.995

Recall

0.995

0.981

0.968

0.996

0.995
F1
0.985

0.973

0.938

0.998

0.995

NERmembert-large-3entities (this model)

Precision

0.979

0.970

0.927

0.999

0.996

Recall

0.996

0.987

0.980

0.997

0.996
F1
0.987

0.979

0.953

0.998

0.996

NERmembert-base-4entities

Precision

0.976

0.961

0.910

0.999

0.995

Recall

0.994

0.985

0.967

0.996

0.995
F1
0.985

0.973

0.938

0.998

0.995

NERmembert-large-4entities

Precision

0.979

0.967

0.922

0.999

0.996

Recall

0.996

0.986

0.974

0.974

0.996
F1
0.987

0.976

0.948

0.998

0.996

wikiner

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

Jean-Baptiste/camembert-ner

0.986

0.966

0.938

cmarkea/distilcamembert-base-ner

0.983

0.964

0.925

NERmembert-base-3entities

0.969

0.945

0.878

NERmembert-large-3entities (this model)

0.972

0.950

0.893

NERmembert-base-4entities

0.970

0.945

0.876

NERmembert-large-4entities

0.975

0.953

0.896
Full results

Model

Metrics

PER

LOC

ORG

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.986

0.962

0.925

0.999

0.994

Recall

0.987

0.969

0.951

0.965

0.967
F1
0.986

0.966

0.938

0.982

0.980

cmarkea/distilcamembert-base-ner

Precision

0.982

0.951

0.910

0.998

0.994

Recall

0.985

0.963

0.940

0.966

0.967
F1
0.983

0.964

0.925

0.982

0.80

NERmembert-base-3entities

Precision

0.971

0.947

0.866

0.994

0.989

Recall

0.969

0.942

0.891

0.995

0.989
F1
0.969

0.945

0.878

0.995

0.989

NERmembert-large-3entities (this model)

Precision

0.973

0.953

0.873

0.996

0.990

Recall

0.990

0.948

0.913

0.995

0.990
F1
0.972

0.950

0.893

0.996

0.990

NERmembert-base-4entities

Precision

0.970

0.944

0.872

0.955

0.988

Recall

0.989

0.947

0.880

0.995

0.988
F1
0.970

0.945

0.876

0.995

0.988

NERmembert-large-4entities

Precision

0.975

0.957

0.872

0.996

0.991

Recall

0.975

0.949

0.922

0.996

0.991
F1
0.975

0.953

0.896

0.996

0.991

wikiann

For space reasons, we show only the F1 of the different models. You can see the full results below the table.


Model

PER

LOC

ORG

Jean-Baptiste/camembert-ner

0.867

0.722

0.451

cmarkea/distilcamembert-base-ner

0.862

0.722

0.451

NERmembert-base-3entities

0.947

0.906

0.886

NERmembert-large-3entities (this model)

0.949

0.912

0.899

NERmembert-base-4entities

0.888

0.733

0.496

NERmembert-large-4entities

0.905

0.741

0.511
Full results

Model

Metrics

PER

LOC

ORG

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.862

0.700

0.864

0.867

0.832

Recall

0.871

0.746

0.305

0.950

0.772
F1
0.867

0.722

0.451

0.867

0.801

cmarkea/distilcamembert-base-ner

Precision

0.862

0.700

0.864

0.867

0.832

Recall

0.871

0.746

0.305

0.950

0.772
F1
0.867

0.722

0.451

0.907

0.800

NERmembert-base-3entities

Precision

0.948

0.900

0.893

0.979

0.942

Recall

0.946

0.911

0.878

0.982

0.942
F1
0.947

0.906

0.886

0.980

0.942

NERmembert-large-3entities (this model)

Precision

0.958

0.917

0.897

0.980

0.948

Recall

0.940

0.915

0.901

0.983

0.948
F1
0.949

0.912

0.899

0.983

0.948

NERmembert-base-4entities

Precision

0.895

0.727

0.903

0.766

0.794

Recall

0.881

0.740

0.342

0.984

0.794
F1
0.888

0.733

0.496

0.861

0.794

NERmembert-large-4entities

Precision

0.922

0.738

0.923

0.766

0.802

Recall

0.888

0.743

0.353

0.988

0.802
F1
0.905

0.741

0.511

0.863

0.802

Usage

Code

from transformers import pipeline

ner = pipeline('token-classification', model='CATIE-AQ/NERmembert-large-3entities', tokenizer='CATIE-AQ/NERmembert-large-3entities', aggregation_strategy="simple")

result = ner(
"Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus par le designer Sylvain Boyer avec les agences Royalties & Ecobranding. Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique et Marianne, symbolisée par un visage de femme mais privée de son bonnet phrygien caractéristique. La typographie dessinée fait référence à l'Art déco, mouvement artistique des années 1920, décennie pendant laquelle ont eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la première fois, ce logo sera unique pour les Jeux olympiques et les Jeux paralympiques."
)

print(result)
[{'entity_group': 'LOC', 'score': 0.96300715, 'word': 'Grand Rex', 'start': 74, 'end': 84},
{'entity_group': 'PER', 'score': 0.84991235, 'word': 'Sylvain Boyer', 'start': 164, 'end': 178},
{'entity_group': 'ORG', 'score': 0.63318396, 'word': 'Royalties & Ecobranding', 'start': 195, 'end': 219}]

Try it through Space

A Space has been created to test the model. It is available here.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.0299 1.0 43650 0.0970 0.9837 0.9837 0.9837 0.9837
0.0164 2.0 87300 0.0835 0.9864 0.9864 0.9864 0.9864
0.0108 3.0 130950 0.0846 0.9874 0.9874 0.9874 0.9874

Framework versions

  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.16.1
  • Tokenizers 0.15.0

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

  • Hardware Type: A100 PCIe 40/80GB
  • Hours used: 4h31min
  • Cloud Provider: Private Infrastructure
  • Carbon Efficiency (kg/kWh): 0.077 (estimated from electricitymaps for the day of January 12, 2024.)
  • Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 0.009 kg eq. CO2

Citations

NERembert-large-3entities

@misc {NERmembert2024,
    author       = { {BOURDOIS, Loïck} },  
    organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { NERmembert-large-3entities },
    year         = 2024,
    url          = { https://huggingface.co/CATIE-AQ/NERmembert-large-3entities },
    doi          = { 10.57967/hf/1752 },
    publisher    = { Hugging Face }
}

multiconer

@inproceedings{multiconer2-report,  
    title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},  
    author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},  
    booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},  
    year={2023},  
    publisher={Association for Computational Linguistics}}

@article{multiconer2-data,  
    title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},  
    author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},  
    year={2023}}

multinerd

 @inproceedings{tedeschi-navigli-2022-multinerd,  
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",  
    author = "Tedeschi, Simone and  Navigli, Roberto",  
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",  
    month = jul,  
    year = "2022",  
    address = "Seattle, United States",  
    publisher = "Association for Computational Linguistics",  
    url = "https://aclanthology.org/2022.findings-naacl.60",  
    doi = "10.18653/v1/2022.findings-naacl.60",  
    pages = "801--812"}

pii-masking-200k

@misc {ai4privacy_2023,  
    author = { {ai4Privacy} },  
    title = { pii-masking-200k (Revision 1d4c0a1) },  
    year = 2023,  
    url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k },  
    doi = { 10.57967/hf/1532 },  
    publisher = { Hugging Face }}

wikiann

@inproceedings{rahimi-etal-2019-massively,  
    title = "Massively Multilingual Transfer for {NER}",  
    author = "Rahimi, Afshin and Li, Yuan and Cohn, Trevor",  
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",  
    month = jul,  
    year = "2019",  
    address = "Florence, Italy",  
    publisher = "Association for Computational Linguistics",  
    url = "https://www.aclweb.org/anthology/P19-1015",  
    pages = "151--164"}

wikiner

@article{NOTHMAN2013151,  
    title = {Learning multilingual named entity recognition from Wikipedia},  
    journal = {Artificial Intelligence},  
    volume = {194},  
    pages = {151-175},  
    year = {2013},  
    note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},  
    issn = {0004-3702},  
    doi = {https://doi.org/10.1016/j.artint.2012.03.006},  
    url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},  
    author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}

frenchNER_3entities

@misc {frenchNER2024,  
    author       = { {BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { frenchNER_3entities },  
    year         = 2024,  
    url          = { https://huggingface.co/CATIE-AQ/frenchNER_3entities },  
    doi          = { 10.57967/hf/1751 },  
    publisher    = { Hugging Face }  
}

CamemBERT

@inproceedings{martin2020camembert,  
  title={CamemBERT: a Tasty French Language Model},  
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},  
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},  
  year={2020}}

License

cc-by-4.0