Files changed (1) hide show
  1. README.md +67 -24
README.md CHANGED
@@ -7,15 +7,58 @@ datasets:
7
 
8
  # CamemBERT: a Tasty French Language Model
9
 
10
- ## Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- [CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
13
 
14
- It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
17
 
18
- ## Pre-trained models
 
 
 
 
19
 
20
  | Model | #params | Arch. | Training data |
21
  |--------------------------------|--------------------------------|-------|-----------------------------------|
@@ -26,7 +69,25 @@ For further information or requests, please go to [Camembert Website](https://ca
26
  | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
27
  | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
28
 
29
- ## How to use CamemBERT with HuggingFace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ##### Load CamemBERT and its sub-word tokenizer :
32
  ```python
@@ -95,21 +156,3 @@ all_layer_embeddings[5]
95
  # ...,
96
  ```
97
 
98
-
99
- ## Authors
100
-
101
- CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
102
-
103
-
104
- ## Citation
105
- If you use our work, please cite:
106
-
107
- ```bibtex
108
- @inproceedings{martin2020camembert,
109
- title={CamemBERT: a Tasty French Language Model},
110
- author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
111
- booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
112
- year={2020}
113
- }
114
- ```
115
-
 
7
 
8
  # CamemBERT: a Tasty French Language Model
9
 
10
+ ## Table of Contents
11
+ - [Model Details](#model-details)
12
+ - [Uses](#uses)
13
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
14
+ - [Training](#training)
15
+ - [Evaluation](#evaluation)
16
+ - [Citation Information](#citation-information)
17
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
18
+
19
+
20
+ ## Model Details
21
+ - **Model Description:**
22
+ CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
23
+ It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
24
+ - **Developed by:** Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
25
+ - **Model Type:** Fill-Mask
26
+ - **Language(s):** French
27
+ - **License:** MIT
28
+ - **Parent Model:** See the [RoBERTa base model](https://huggingface.co/roberta-base) for more information about the RoBERTa base model.
29
+ - **Resources for more information:**
30
+ - [Research Paper](https://arxiv.org/abs/1911.03894)
31
+ - [Camembert Website](https://camembert-model.fr/)
32
+
33
+
34
+ ## Uses
35
 
36
+ #### Direct Use
37
 
38
+ This model can be used for Fill-Mask tasks.
39
+
40
+
41
+ ## Risks, Limitations and Biases
42
+ **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
43
+
44
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
45
+
46
+ This model was pretrinaed on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
47
+
48
+ > The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
49
+
50
+ > Constructed from Common Crawl, Personal and sensitive information might be present.
51
+
52
+
53
+
54
+ ## Training
55
 
 
56
 
57
+ #### Training Data
58
+ OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
59
+
60
+
61
+ #### Training Procedure
62
 
63
  | Model | #params | Arch. | Training data |
64
  |--------------------------------|--------------------------------|-------|-----------------------------------|
 
69
  | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
70
  | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
71
 
72
+ ## Evaluation
73
+
74
+
75
+ The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
76
+
77
+
78
+
79
+ ## Citation Information
80
+
81
+ ```bibtex
82
+ @inproceedings{martin2020camembert,
83
+ title={CamemBERT: a Tasty French Language Model},
84
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
85
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
86
+ year={2020}
87
+ }
88
+ ```
89
+
90
+ ## How to Get Started With the Model
91
 
92
  ##### Load CamemBERT and its sub-word tokenizer :
93
  ```python
 
156
  # ...,
157
  ```
158