Update README.md
Browse files
README.md
CHANGED
@@ -41,18 +41,22 @@ The models' predictions should not be taken as ground-truth *titrages* and shoul
|
|
41 |
|
42 |
Training data is provided by the Cour de Cassation (the original source being Jurinet data, but with pseudo-anonymisation applied). For training, we use a total of 159,836 parallel examples (each example is a sommaire-titrage pair). Our development data consists of 1,833 held-out examples.
|
43 |
|
44 |
-
## Training procedure
|
45 |
-
|
46 |
|
|
|
47 |
|
48 |
### Preprocessing
|
49 |
|
|
|
50 |
|
51 |
### Training
|
52 |
|
|
|
|
|
53 |
### Evaluation results
|
54 |
|
55 |
-
|
|
|
|
|
56 |
|
57 |
## BibTex entry and citation info
|
58 |
<a name="cite"></a>
|
|
|
41 |
|
42 |
Training data is provided by the Cour de Cassation (the original source being Jurinet data, but with pseudo-anonymisation applied). For training, we use a total of 159,836 parallel examples (each example is a sommaire-titrage pair). Our development data consists of 1,833 held-out examples.
|
43 |
|
|
|
|
|
44 |
|
45 |
+
## Training procedure
|
46 |
|
47 |
### Preprocessing
|
48 |
|
49 |
+
We use SentencePiece, the BPE strategy and a joint vocabulary of 8000 tokens. This model was converted into the HuggingFace format and integrates a number of normalisation processes (e.g. removing double doubles, apostrophes and quotes, normalisation of different accent formats, lowercasing).
|
50 |
|
51 |
### Training
|
52 |
|
53 |
+
The model was initialised trained using Fairseq until convergence on the development set (according to our customised weighted accuracy measure - please see [the paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf) for more details). The model was then converted to HuggingFace and training continued to smooth out incoherences introduced during the conversion procedure (incompatibilities in the way the SentencePiece and NMT vocabularies are defined, linked to HuggingFace vocabularies being necessarily the same as the tokeniser vocabulary, a constraint that is not imposed in Fairseq).
|
54 |
+
|
55 |
### Evaluation results
|
56 |
|
57 |
+
Full results for the initial Fairseq models can be found in [the paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf).
|
58 |
+
|
59 |
+
Results on this converted model coming soon!
|
60 |
|
61 |
## BibTex entry and citation info
|
62 |
<a name="cite"></a>
|