CATIE-AQ
/

NERmembert-large-3entities

Token Classification

Inference Endpoints

Model card Files Files and versions Community

NERmembert-large-3entities / README.md

bourdoiscatie's picture

Update README.md

b5fd176 verified 10 months ago

|

history blame contribute delete

39.7 kB

metadata

license: cc-by-4.0
base_model: camembert/camembert-large
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: NERmembert-large-3entities
    results: []
datasets:
  - CATIE-AQ/frenchNER_3entities
language:
  - fr
widget:
  - text: >-
      Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au
      Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus
      par le designer Sylvain Boyer avec les agences Royalties & Ecobranding.
      Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique
      et Marianne, symbolisée par un visage de femme mais privée de son bonnet
      phrygien caractéristique. La typographie dessinée fait référence à l'Art
      déco, mouvement artistique des années 1920, décennie pendant laquelle ont
      eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la
      première fois, ce logo sera unique pour les Jeux olympiques et les Jeux
      paralympiques.
library_name: transformers
pipeline_tag: token-classification
co2_eq_emissions: 90

NERmembert-large-3entities

Model Description

We present NERmembert-large-3entities, which is a CamemBERT large fine-tuned for the Name Entity Recognition task for the French language on five French NER datasets for 3 entities (LOC, PER, ORG).
All these datasets were concatenated and cleaned into a single dataset that we called frenchNER_3entities.
This represents a total of over 420,264 rows, of which 346,071 are for training, 32,951 for validation and 41,242 for testing.
Our methodology is described in a blog post available in English or French.

Dataset

The dataset used is frenchNER_3entities, which represents ~420k sentences labeled in 4 categories:

Label	Examples
PER	"La Bruyère", "Gaspard de Coligny", "Wittgenstein"
ORG	"UTBM", "American Airlines", "id Software"
LOC	"République du Cap-Vert", "Créteil", "Bordeaux"

The distribution of the entities is as follows:

Splits	O	PER	LOC	ORG
train	8,398,765	327,393	303,722	151,490
validation	592,815	34,127	30,279	18,743
test	773,871	43,634	39,195	21,391

Evaluation results

The evaluation was carried out using the evaluate python package.

frenchNER_3entities

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG
Jean-Baptiste/camembert-ner	0.941	0.883	0.658
cmarkea/distilcamembert-base-ner	0.942	0.882	0.647
NERmembert-base-3entities	0.966	0.940	0.876
NERmembert-large-3entities (this model)	0.969	0.947	0.890
NERmembert-base-4entities	0.951	0.894	0.671
NERmembert-large-4entities	0.958	0.901	0.685

Full results

Model	Metrics	PER	LOC	ORG	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.918	0.860	0.831	0.992	0.974
	Recall	0.964	0.908	0.544	0.964	0.948
	F1	0.941	0.883	0.658	0.978	0.961
cmarkea/distilcamembert-base-ner	Precision	0.929	0.861	0.813	0.991	0.974
	Recall	0.956	0.905	0.956	0.965	0.948
	F1	0.942	0.882	0.647	0.978	0.961
NERmembert-base-3entities	Precision	0.961	0.935	0.877	0.995	0.986
	Recall	0.972	0.946	0.876	0.994	0.986
	F1	0.966	0.940	0.876	0.994	0.986
NERmembert-large-3entities (this model)	Precision	0.966	0.944	0.884	0.996	0.987
	Recall	0.950	0.972	0.896	0.994	0.987
	F1	0.969	0.947	0.890	0.995	0.987
NERmembert-base-4entities	Precision	0.946	0.884	0.859	0.993	0.971
	Recall	0.955	0.904	0.550	0.993	0.971
	F1	0.951	0.894	0.671	0.988	0.971
NERmembert-large-4entities	Precision	0.955	0.896	0.866	0.983	0.974
	Recall	0.960	0.906	0.567	0.994	0.974
	F1	0.958	0.901	0.685	0.988	0.974

In detail:

multiconer

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG
Jean-Baptiste/camembert-ner	0.940	0.761	0.723
cmarkea/distilcamembert-base-ner	0.921	0.748	0.694
NERmembert-base-3entities	0.960	0.887	0.876
NERmembert-large-3entities (this model)	0.965	0.902	0.896
NERmembert-base-4entities	0.960	0.890	0.867
NERmembert-large-4entities	0.969	0.919	0.904

Full results

Model	Metrics	PER	LOC	ORG	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.908	0.717	0.753	0.987	0.947
	Recall	0.975	0.811	0.696	0.878	0.880
	F1	0.940	0.761	0.723	0.929	0.912
cmarkea/distilcamembert-base-ner	Precision	0.885	0.738	0.737	0.983	0.943
	Recall	0.960	0.759	0.655	0.882	0.877
	F1	0.921	0.748	0.694	0.930	0.909
NERmembert-base-3entities	Precision	0.957	0.894	0.876	0.986	0.972
	Recall	0.962	0.880	0.878	0.985	0.972
	F1	0.960	0.887	0.876	0.985	0.972
NERmembert-large-3entities (this model)	Precision	0.960	0.903	0.916	0.987	0.976
	Recall	0.969	0.900	0.877	0.987	0.976
	F1	0.965	0.902	0.896	0.987	0.976
NERmembert-base-4entities	Precision	0.954	0.893	0.851	0.988	0.972
	Recall	0.967	0.887	0.883	0.984	0.972
	F1	0.960	0.890	0.867	0.986	0.972
NERmembert-large-4entities	Precision	0.964	0.922	0.904	0.990	0.978
	Recall	0.975	0.917	0.904	0.988	0.978
	F1	0.969	0.919	0.904	0.989	0.978

multinerd

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG
Jean-Baptiste/camembert-ner	0.962	0.934	0.888
cmarkea/distilcamembert-base-ner	0.972	0.938	0.884
NERmembert-base-3entities	0.985	0.973	0.938
NERmembert-large-3entities (this model)	0.987	0.979	0.953
NERmembert-base-4entities	0.985	0.973	0.938
NERmembert-large-4entities	0.987	0.976	0.948

Full results

Model	Metrics	PER	LOC	ORG	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.931	0.893	0.827	0.999	0.988
	Recall	0.994	0.980	0.959	0.973	0.974
	F1	0.962	0.934	0.888	0.986	0.981
cmarkea/distilcamembert-base-ner	Precision	0.954	0.908	0.817	0.999	0.990
	Recall	0.991	0.969	0.963	0.975	0.975
	F1	0.972	0.938	0.884	0.987	0.983
NERmembert-base-3entities	Precision	0.974	0.965	0.910	0.999	0.995
	Recall	0.995	0.981	0.968	0.996	0.995
	F1	0.985	0.973	0.938	0.998	0.995
NERmembert-large-3entities (this model)	Precision	0.979	0.970	0.927	0.999	0.996
	Recall	0.996	0.987	0.980	0.997	0.996
	F1	0.987	0.979	0.953	0.998	0.996
NERmembert-base-4entities	Precision	0.976	0.961	0.910	0.999	0.995
	Recall	0.994	0.985	0.967	0.996	0.995
	F1	0.985	0.973	0.938	0.998	0.995
NERmembert-large-4entities	Precision	0.979	0.967	0.922	0.999	0.996
	Recall	0.996	0.986	0.974	0.974	0.996
	F1	0.987	0.976	0.948	0.998	0.996

wikiner

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG
Jean-Baptiste/camembert-ner	0.986	0.966	0.938
cmarkea/distilcamembert-base-ner	0.983	0.964	0.925
NERmembert-base-3entities	0.969	0.945	0.878
NERmembert-large-3entities (this model)	0.972	0.950	0.893
NERmembert-base-4entities	0.970	0.945	0.876
NERmembert-large-4entities	0.975	0.953	0.896

Full results

Model	Metrics	PER	LOC	ORG	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.986	0.962	0.925	0.999	0.994
	Recall	0.987	0.969	0.951	0.965	0.967
	F1	0.986	0.966	0.938	0.982	0.980
cmarkea/distilcamembert-base-ner	Precision	0.982	0.951	0.910	0.998	0.994
	Recall	0.985	0.963	0.940	0.966	0.967
	F1	0.983	0.964	0.925	0.982	0.80
NERmembert-base-3entities	Precision	0.971	0.947	0.866	0.994	0.989
	Recall	0.969	0.942	0.891	0.995	0.989
	F1	0.969	0.945	0.878	0.995	0.989
NERmembert-large-3entities (this model)	Precision	0.973	0.953	0.873	0.996	0.990
	Recall	0.990	0.948	0.913	0.995	0.990
	F1	0.972	0.950	0.893	0.996	0.990
NERmembert-base-4entities	Precision	0.970	0.944	0.872	0.955	0.988
	Recall	0.989	0.947	0.880	0.995	0.988
	F1	0.970	0.945	0.876	0.995	0.988
NERmembert-large-4entities	Precision	0.975	0.957	0.872	0.996	0.991
	Recall	0.975	0.949	0.922	0.996	0.991
	F1	0.975	0.953	0.896	0.996	0.991

wikiann

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG
Jean-Baptiste/camembert-ner	0.867	0.722	0.451
cmarkea/distilcamembert-base-ner	0.862	0.722	0.451
NERmembert-base-3entities	0.947	0.906	0.886
NERmembert-large-3entities (this model)	0.949	0.912	0.899
NERmembert-base-4entities	0.888	0.733	0.496
NERmembert-large-4entities	0.905	0.741	0.511

Full results

Model	Metrics	PER	LOC	ORG	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.862	0.700	0.864	0.867	0.832
	Recall	0.871	0.746	0.305	0.950	0.772
	F1	0.867	0.722	0.451	0.867	0.801
cmarkea/distilcamembert-base-ner	Precision	0.862	0.700	0.864	0.867	0.832
	Recall	0.871	0.746	0.305	0.950	0.772
	F1	0.867	0.722	0.451	0.907	0.800
NERmembert-base-3entities	Precision	0.948	0.900	0.893	0.979	0.942
	Recall	0.946	0.911	0.878	0.982	0.942
	F1	0.947	0.906	0.886	0.980	0.942
NERmembert-large-3entities (this model)	Precision	0.958	0.917	0.897	0.980	0.948
	Recall	0.940	0.915	0.901	0.983	0.948
	F1	0.949	0.912	0.899	0.983	0.948
NERmembert-base-4entities	Precision	0.895	0.727	0.903	0.766	0.794
	Recall	0.881	0.740	0.342	0.984	0.794
	F1	0.888	0.733	0.496	0.861	0.794
NERmembert-large-4entities	Precision	0.922	0.738	0.923	0.766	0.802
	Recall	0.888	0.743	0.353	0.988	0.802
	F1	0.905	0.741	0.511	0.863	0.802

Usage

Code

from transformers import pipeline

ner = pipeline('token-classification', model='CATIE-AQ/NERmembert-large-3entities', tokenizer='CATIE-AQ/NERmembert-large-3entities', aggregation_strategy="simple")

result = ner(
"Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus par le designer Sylvain Boyer avec les agences Royalties & Ecobranding. Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique et Marianne, symbolisée par un visage de femme mais privée de son bonnet phrygien caractéristique. La typographie dessinée fait référence à l'Art déco, mouvement artistique des années 1920, décennie pendant laquelle ont eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la première fois, ce logo sera unique pour les Jeux olympiques et les Jeux paralympiques."
)

print(result)

[{'entity_group': 'LOC', 'score': 0.96300715, 'word': 'Grand Rex', 'start': 74, 'end': 84},
{'entity_group': 'PER', 'score': 0.84991235, 'word': 'Sylvain Boyer', 'start': 164, 'end': 178},
{'entity_group': 'ORG', 'score': 0.63318396, 'word': 'Royalties & Ecobranding', 'start': 195, 'end': 219}]

Try it through Space

A Space has been created to test the model. It is available here.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0299	1.0	43650	0.0970	0.9837	0.9837	0.9837	0.9837
0.0164	2.0	87300	0.0835	0.9864	0.9864	0.9864	0.9864
0.0108	3.0	130950	0.0846	0.9874	0.9874	0.9874	0.9874

Framework versions

Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.16.1
Tokenizers 0.15.0

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

Hardware Type: A100 PCIe 40/80GB
Hours used: 4h31min
Cloud Provider: Private Infrastructure
Carbon Efficiency (kg/kWh): 0.077 (estimated from electricitymaps for the day of January 12, 2024.)
Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 0.009 kg eq. CO2

Citations

NERembert-large-3entities

@misc {NERmembert2024,
    author       = { {BOURDOIS, Loïck} },  
    organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { NERmembert-large-3entities },
    year         = 2024,
    url          = { https://huggingface.co/CATIE-AQ/NERmembert-large-3entities },
    doi          = { 10.57967/hf/1752 },
    publisher    = { Hugging Face }
}

multiconer

@inproceedings{multiconer2-report,  
    title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},  
    author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},  
    booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},  
    year={2023},  
    publisher={Association for Computational Linguistics}}

@article{multiconer2-data,  
    title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},  
    author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},  
    year={2023}}

multinerd

 @inproceedings{tedeschi-navigli-2022-multinerd,  
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",  
    author = "Tedeschi, Simone and  Navigli, Roberto",  
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",  
    month = jul,  
    year = "2022",  
    address = "Seattle, United States",  
    publisher = "Association for Computational Linguistics",  
    url = "https://aclanthology.org/2022.findings-naacl.60",  
    doi = "10.18653/v1/2022.findings-naacl.60",  
    pages = "801--812"}

pii-masking-200k

@misc {ai4privacy_2023,  
    author = { {ai4Privacy} },  
    title = { pii-masking-200k (Revision 1d4c0a1) },  
    year = 2023,  
    url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k },  
    doi = { 10.57967/hf/1532 },  
    publisher = { Hugging Face }}

wikiann

@inproceedings{rahimi-etal-2019-massively,  
    title = "Massively Multilingual Transfer for {NER}",  
    author = "Rahimi, Afshin and Li, Yuan and Cohn, Trevor",  
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",  
    month = jul,  
    year = "2019",  
    address = "Florence, Italy",  
    publisher = "Association for Computational Linguistics",  
    url = "https://www.aclweb.org/anthology/P19-1015",  
    pages = "151--164"}

wikiner

@article{NOTHMAN2013151,  
    title = {Learning multilingual named entity recognition from Wikipedia},  
    journal = {Artificial Intelligence},  
    volume = {194},  
    pages = {151-175},  
    year = {2013},  
    note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},  
    issn = {0004-3702},  
    doi = {https://doi.org/10.1016/j.artint.2012.03.006},  
    url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},  
    author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}

frenchNER_3entities

@misc {frenchNER2024,  
    author       = { {BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { frenchNER_3entities },  
    year         = 2024,  
    url          = { https://huggingface.co/CATIE-AQ/frenchNER_3entities },  
    doi          = { 10.57967/hf/1751 },  
    publisher    = { Hugging Face }  
}

CamemBERT

@inproceedings{martin2020camembert,  
  title={CamemBERT: a Tasty French Language Model},  
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},  
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},  
  year={2020}}

License