rbawden commited on
Commit
59b197b
1 Parent(s): 5899b32

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -41,18 +41,22 @@ The models' predictions should not be taken as ground-truth *titrages* and shoul
41
 
42
  Training data is provided by the Cour de Cassation (the original source being Jurinet data, but with pseudo-anonymisation applied). For training, we use a total of 159,836 parallel examples (each example is a sommaire-titrage pair). Our development data consists of 1,833 held-out examples.
43
 
44
- ## Training procedure
45
-
46
 
 
47
 
48
  ### Preprocessing
49
 
 
50
 
51
  ### Training
52
 
 
 
53
  ### Evaluation results
54
 
55
- Coming soon
 
 
56
 
57
  ## BibTex entry and citation info
58
  <a name="cite"></a>
 
41
 
42
  Training data is provided by the Cour de Cassation (the original source being Jurinet data, but with pseudo-anonymisation applied). For training, we use a total of 159,836 parallel examples (each example is a sommaire-titrage pair). Our development data consists of 1,833 held-out examples.
43
 
 
 
44
 
45
+ ## Training procedure
46
 
47
  ### Preprocessing
48
 
49
+ We use SentencePiece, the BPE strategy and a joint vocabulary of 8000 tokens. This model was converted into the HuggingFace format and integrates a number of normalisation processes (e.g. removing double doubles, apostrophes and quotes, normalisation of different accent formats, lowercasing).
50
 
51
  ### Training
52
 
53
+ The model was initialised trained using Fairseq until convergence on the development set (according to our customised weighted accuracy measure - please see [the paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf) for more details). The model was then converted to HuggingFace and training continued to smooth out incoherences introduced during the conversion procedure (incompatibilities in the way the SentencePiece and NMT vocabularies are defined, linked to HuggingFace vocabularies being necessarily the same as the tokeniser vocabulary, a constraint that is not imposed in Fairseq).
54
+
55
  ### Evaluation results
56
 
57
+ Full results for the initial Fairseq models can be found in [the paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf).
58
+
59
+ Results on this converted model coming soon!
60
 
61
  ## BibTex entry and citation info
62
  <a name="cite"></a>