stefan-it commited on
Commit
86d2ddf
1 Parent(s): 947455b

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md CHANGED
@@ -1,3 +1,78 @@
1
  ---
2
  license: other
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
+ language:
4
+ - en
5
+ - de
6
+ - fr
7
+ - fi
8
+ - sv
9
+ - nl
10
+ - nb
11
+ - nn
12
+ - 'no'
13
  ---
14
+
15
+ # hmTEAMS
16
+
17
+ [![🤗](https://github.com/stefan-it/hmTEAMS/raw/main/logo.jpeg "🤗")](https://github.com/stefan-it/hmTEAMS)
18
+
19
+ Historic Multilingual and Monolingual [TEAMS](https://aclanthology.org/2021.findings-acl.219/) Models.
20
+ The following languages are covered:
21
+
22
+ * English (British Library Corpus - Books)
23
+ * German (Europeana Newspaper)
24
+ * French (Europeana Newspaper)
25
+ * Finnish (Europeana Newspaper, Digilib)
26
+ * Swedish (Europeana Newspaper, Digilib)
27
+ * Dutch (Delpher Corpus)
28
+ * Norwegian (NCC Corpus)
29
+
30
+ # Architecture
31
+
32
+ We pretrain a "Training ELECTRA Augmented with Multi-word Selection"
33
+ ([TEAMS](https://aclanthology.org/2021.findings-acl.219/)) model:
34
+
35
+ ![hmTEAMS Overview](https://github.com/stefan-it/hmTEAMS/raw/main/hmteams_overview.svg)
36
+
37
+ # Results
38
+
39
+ We perform experiments on various historic NER datasets, such as HIPE-2022 or ICDAR Europeana.
40
+ All details incl. hyper-parameters can be found [here](https://github.com/stefan-it/hmTEAMS/tree/main/bench).
41
+
42
+ ## Small Benchmark
43
+
44
+ We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana.
45
+ The following table shows an overview of used datasets.
46
+
47
+ | Language | Dataset | Additional Dataset |
48
+ |----------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
49
+ | English | [AjMC](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md) | - |
50
+ | German | [AjMC](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md) | - |
51
+ | French | [AjMC](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md) | [ICDAR-Europeana](https://github.com/stefan-it/historic-domain-adaptation-icdar) |
52
+ | Finnish | [NewsEye](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-newseye.md) | - |
53
+ | Swedish | [NewsEye](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-newseye.md) | - |
54
+ | Dutch | [ICDAR-Europeana](https://github.com/stefan-it/historic-domain-adaptation-icdar) | - |
55
+
56
+ # Results
57
+
58
+ | Model | English AjMC | German AjMC | French AjMC | Finnish NewsEye | Swedish NewsEye | Dutch ICDAR | French ICDAR | Avg. |
59
+ |----------------------------------------------------------------------------------------|--------------|--------------|--------------|-----------------|-----------------|--------------|--------------|-----------|
60
+ | hmBERT (32k) [Schweter et al.](https://ceur-ws.org/Vol-3180/paper-87.pdf) | 85.36 ± 0.94 | 89.08 ± 0.09 | 85.10 ± 0.60 | 77.28 ± 0.37 | 82.85 ± 0.83 | 82.11 ± 0.61 | 77.21 ± 0.16 | 82.71 |
61
+ | hmTEAMS (Ours) | 86.41 ± 0.36 | 88.64 ± 0.42 | 85.41 ± 0.67 | 79.27 ± 1.88 | 82.78 ± 0.60 | 88.21 ± 0.39 | 78.03 ± 0.39 | **84.11** |
62
+
63
+ # Release
64
+
65
+ Our pretrained hmTEAMS model can be obtained from the Hugging Face Model Hub. Because of complicated
66
+ license issues (that needs to be figured out), the model is only available by requesting access from
67
+ Model Hub:
68
+
69
+ * [hmTEAMS Discriminator (**this model**)](https://huggingface.co/hmteams/teams-base-historic-multilingual-discriminator)
70
+ * [hmTEAMS Generator](https://huggingface.co/hmteams/teams-base-historic-multilingual-generator)
71
+
72
+ # Acknowledgements
73
+
74
+ We thank [Luisa März](https://github.com/LuisaMaerz), [Katharina Schmid](https://github.com/schmika) and
75
+ [Erion Çano](https://github.com/erionc) for their fruitful discussions about Historic Language Models.
76
+
77
+ Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
78
+ Many Thanks for providing access to the TPUs ❤️