qanastek commited on
Commit
854b8b0
1 Parent(s): b452b80

Update README.md

Browse files
Files changed (2) hide show
  1. README.md +6 -4
  2. train.py +54 -0
README.md CHANGED
@@ -10,11 +10,11 @@ widget:
10
 
11
  # POET: A French Extended Part-of-Speech Tagger
12
 
13
- - Corpus: [UD_FRENCH_TREEBANKS](https://universaldependencies.org/treebanks/fr_gsd/index.html)
14
  - Model: [Flair](https://www.aclweb.org/anthology/C18-1139.pdf)
15
  - Embeddings: [FastText](https://fasttext.cc/)
16
- - Additionnel: [LSTM-CRF](https://arxiv.org/abs/1011.4088)
17
- - Nombre d'Epochs: 115
18
 
19
  **People Involved**
20
 
@@ -54,12 +54,14 @@ Output:
54
 
55
  `UD_FRENCH_GSD_Plus` is a part-of-speech tagging corpora based on [UD_French-GSD](https://universaldependencies.org/treebanks/fr_gsd/index.html) which was originally created in 2015 and is based on the [universal dependency treebank v2.0](https://github.com/ryanmcd/uni-dep-tb).
56
 
57
- Originally, the corpora consists of 400,399 words (16,341 sentences) and had 17 different classes. Now, after applying our tags augmentation we obtain 60 different classes which add semantic information such as the gender, number, mood, person, tense or verb form given in the different CoNLL-03 fields from the original corpora.
58
 
59
  We based our tags on the level of details given by the [LIA_TAGG](http://pageperso.lif.univ-mrs.fr/frederic.bechet/download.html) statistical POS tagger written by [Frédéric Béchet](http://pageperso.lif.univ-mrs.fr/frederic.bechet/index-english.html) in 2001.
60
 
61
  The corpora used for this model is available on [Github](https://github.com/qanastek/UD_FRENCH_GSD_PLUS) at the [CoNLL-U format](https://universaldependencies.org/format.html).
62
 
 
 
63
  ## Original Tags
64
 
65
  ```plain
 
10
 
11
  # POET: A French Extended Part-of-Speech Tagger
12
 
13
+ - Corpora: [UD_FRENCH_TREEBANKS](https://universaldependencies.org/treebanks/fr_gsd/index.html)
14
  - Model: [Flair](https://www.aclweb.org/anthology/C18-1139.pdf)
15
  - Embeddings: [FastText](https://fasttext.cc/)
16
+ - Sequence Labelling: [LSTM-CRF](https://arxiv.org/abs/1011.4088)
17
+ - Number of Epochs: 115
18
 
19
  **People Involved**
20
 
 
54
 
55
  `UD_FRENCH_GSD_Plus` is a part-of-speech tagging corpora based on [UD_French-GSD](https://universaldependencies.org/treebanks/fr_gsd/index.html) which was originally created in 2015 and is based on the [universal dependency treebank v2.0](https://github.com/ryanmcd/uni-dep-tb).
56
 
57
+ Originally, the corpora consists of 400,399 words (16,341 sentences) and had 17 different classes. Now, after applying our tags augmentation we obtain 60 different classes which add linguistic and semantic information such as the gender, number, mood, person, tense or verb form given in the different CoNLL-03 fields from the original corpora.
58
 
59
  We based our tags on the level of details given by the [LIA_TAGG](http://pageperso.lif.univ-mrs.fr/frederic.bechet/download.html) statistical POS tagger written by [Frédéric Béchet](http://pageperso.lif.univ-mrs.fr/frederic.bechet/index-english.html) in 2001.
60
 
61
  The corpora used for this model is available on [Github](https://github.com/qanastek/UD_FRENCH_GSD_PLUS) at the [CoNLL-U format](https://universaldependencies.org/format.html).
62
 
63
+ Training data are fed to the model as free language and doesn't pass a normalization phase. Thus, it's made the model case and punctuation sensitive.
64
+
65
  ## Original Tags
66
 
67
  ```plain
train.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import argparse
3
+ from datetime import datetime
4
+
5
+ from flair.data import Corpus
6
+ from flair.models import SequenceTagger
7
+ from flair.trainers import ModelTrainer
8
+ from flair.datasets import UniversalDependenciesCorpus
9
+ from flair.embeddings import WordEmbeddings, StackedEmbeddings
10
+
11
+ parser = argparse.ArgumentParser(description='Flair Training Part-of-speech tagging')
12
+ parser.add_argument('-output', type=str, default="models/", help='The output directory')
13
+ parser.add_argument('-epochs', type=int, default=1, help='Number of Epochs')
14
+ args = parser.parse_args()
15
+
16
+ output = os.path.join(args.output, "UPOS_UD_FRENCH_PLUS_" + str(args.epochs) + "_" + datetime.today().strftime('%Y-%m-%d-%H:%M:%S'))
17
+ print(output)
18
+
19
+ # corpus: Corpus = UD_FRENCH()
20
+ corpus: Corpus = UniversalDependenciesCorpus(
21
+ data_folder='UD_FRENCH_PLUS',
22
+ train_file="fr_gsd-ud-train.conllu",
23
+ test_file="fr_gsd-ud-test.conllu",
24
+ dev_file="fr_gsd-ud-dev.conllu",
25
+ )
26
+ # print(corpus)
27
+
28
+ tag_type = 'upos'
29
+
30
+ tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
31
+ # print(tag_dictionary)
32
+
33
+ embedding_types = [
34
+ WordEmbeddings('fr'),
35
+ ]
36
+
37
+ embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
38
+
39
+ tagger: SequenceTagger = SequenceTagger(
40
+ hidden_size=256,
41
+ embeddings=embeddings,
42
+ tag_dictionary=tag_dictionary,
43
+ tag_type=tag_type,
44
+ use_crf=True
45
+ )
46
+
47
+ trainer: ModelTrainer = ModelTrainer(tagger, corpus)
48
+
49
+ trainer.train(
50
+ output,
51
+ learning_rate=0.1,
52
+ mini_batch_size=128,
53
+ max_epochs=args.epochs
54
+ )