alanakbik commited on
Commit
69150c0
1 Parent(s): b12a7e7

initial model commit

Browse files
Files changed (1) hide show
  1. README.md +148 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - flair
4
+ - token-classification
5
+ - sequence-tagger-model
6
+ language: en de nl es
7
+ datasets:
8
+ - conll2003
9
+ inference: false
10
+ ---
11
+
12
+ ## 4-Language NER in Flair (English, German, Dutch and Spanish)
13
+
14
+ This is the standard 4-class NER model for 4 CoNLL-03 languages that ships with [Flair](https://github.com/flairNLP/flair/). Also kind of works for related languages like French.
15
+
16
+ F1-Score: **92,16** (CoNLL-03 English), **87,33** (CoNLL-03 German revised), **88,96** (CoNLL-03 Dutch), **86,65** (CoNLL-03 Spanish)
17
+
18
+
19
+ Predicts 4 tags:
20
+
21
+ | **tag** | **meaning** |
22
+ |---------------------------------|-----------|
23
+ | PER | person name |
24
+ | LOC | location name |
25
+ | ORG | organization name |
26
+ | MISC | other name |
27
+
28
+ Based on [Flair embeddings](https://www.aclweb.org/anthology/C18-1139/) and LSTM-CRF.
29
+
30
+ ---
31
+
32
+ ### Demo: How to use in Flair
33
+
34
+ Requires: **[Flair](https://github.com/flairNLP/flair/)** (`pip install flair`)
35
+
36
+ ```python
37
+ from flair.data import Sentence
38
+ from flair.models import SequenceTagger
39
+
40
+ # load tagger
41
+ tagger = SequenceTagger.load("flair/ner-multi")
42
+
43
+ # make example sentence in any of the four languages
44
+ sentence = Sentence("George Washington ging nach Washington")
45
+
46
+ # predict NER tags
47
+ tagger.predict(sentence)
48
+
49
+ # print sentence
50
+ print(sentence)
51
+
52
+ # print predicted NER spans
53
+ print('The following NER tags are found:')
54
+ # iterate over entities and print
55
+ for entity in sentence.get_spans('ner'):
56
+ print(entity)
57
+
58
+ ```
59
+
60
+ This yields the following output:
61
+ ```
62
+ Span [1,2]: "George Washington" [− Labels: PER (0.9977)]
63
+ Span [5]: "Washington" [− Labels: LOC (0.9895)]
64
+ ```
65
+
66
+ So, the entities "*George Washington*" (labeled as a **person**) and "*Washington*" (labeled as a **location**) are found in the sentence "*George Washington ging nach Washington*".
67
+
68
+
69
+ ---
70
+
71
+ ### Training: Script to train this model
72
+
73
+ The following Flair script was used to train this model:
74
+
75
+ ```python
76
+ from flair.data import Corpus
77
+ from flair.datasets import CONLL_03, CONLL_03_GERMAN, CONLL_03_DUTCH, CONLL_03_SPANISH
78
+ from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings
79
+
80
+ # 1. get the multi-language corpus
81
+ corpus: Corpus = MultiCorpus([
82
+ CONLL_03(), # English corpus
83
+ CONLL_03_GERMAN(), # German corpus
84
+ CONLL_03_DUTCH(), # Dutch corpus
85
+ CONLL_03_SPANISH(), # Spanish corpus
86
+ ])
87
+
88
+ # 2. what tag do we want to predict?
89
+ tag_type = 'ner'
90
+
91
+ # 3. make the tag dictionary from the corpus
92
+ tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
93
+
94
+ # 4. initialize each embedding we use
95
+ embedding_types = [
96
+
97
+ # GloVe embeddings
98
+ WordEmbeddings('glove'),
99
+
100
+ # FastText embeddings
101
+ WordEmbeddings('de'),
102
+
103
+ # contextual string embeddings, forward
104
+ FlairEmbeddings('multi-forward'),
105
+
106
+ # contextual string embeddings, backward
107
+ FlairEmbeddings('multi-backward'),
108
+ ]
109
+
110
+ # embedding stack consists of Flair and GloVe embeddings
111
+ embeddings = StackedEmbeddings(embeddings=embedding_types)
112
+
113
+ # 5. initialize sequence tagger
114
+ from flair.models import SequenceTagger
115
+
116
+ tagger = SequenceTagger(hidden_size=256,
117
+ embeddings=embeddings,
118
+ tag_dictionary=tag_dictionary,
119
+ tag_type=tag_type)
120
+
121
+ # 6. initialize trainer
122
+ from flair.trainers import ModelTrainer
123
+
124
+ trainer = ModelTrainer(tagger, corpus)
125
+
126
+ # 7. run training
127
+ trainer.train('resources/taggers/ner-multi',
128
+ train_with_dev=True,
129
+ max_epochs=150)
130
+ ```
131
+
132
+
133
+
134
+ ---
135
+
136
+ ### Cite
137
+
138
+ Please cite the following paper when using this model.
139
+
140
+ ```
141
+ @inproceedings{akbik2018coling,
142
+ title={Contextual String Embeddings for Sequence Labeling},
143
+ author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
144
+ booktitle = {{COLING} 2018, 27th International Conference on Computational Linguistics},
145
+ pages = {1638--1649},
146
+ year = {2018}
147
+ }
148
+ ```