umarbutler commited on
Commit
a498abe
β€’
1 Parent(s): 494cf32

Expanded benchmarks.

Browse files
Files changed (1) hide show
  1. README.md +13 -8
README.md CHANGED
@@ -23,6 +23,7 @@ widget:
23
  - text: >-
24
  A <mask> of trade is valid to the extent to which it is not against public
25
  policy, whether it is in severable terms or not.
 
26
  - text: >-
27
  In Mabo v <mask> (No 2) (1992) 175 CLR 1, the Court found that the doctrine
28
  of terra nullius was not applicable to Australia at the time of British
@@ -52,7 +53,8 @@ pipeline_tag: fill-mask
52
 
53
  # EmuBert
54
  <img src="https://huggingface.co/umarbutler/emubert/resolve/main/logo.png" width="100" align="left" />
55
- EmuBert is the largest open-source masked language model for Australian law.
 
56
 
57
  Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **semantic similarity**, **vector search** and general **sentence embedding**.
58
 
@@ -147,7 +149,7 @@ Whereas the final block of the training set would have been dropped if it did no
147
 
148
  The final training set comprised 2,885,839 blocks totalling 1,477,549,568 tokens, the validation set comprised 155,563 blocks totalling 79,648,256 tokens, and the test set comprised 155,696 blocks totalling 79,716,352 tokens.
149
 
150
- Instead of training EmuBert from scratch, [Roberta](https://huggingface.co/roberta-base)'s weights were all reused, except for its token embeddings which were either replaced with the average token embedding or, if a token was shared between Roberta and EmuBert's vocabularies, moved to its new position in EmuBert's vocabulary.
151
 
152
  In order to reduce training time, [Better Transformer](https://huggingface.co/docs/optimum/en/bettertransformer/overview) was used to enable fast path execution and scaled dot-product attention, alongside automatic mixed 16-bit precision and [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.AdamW8bit)' 8-bit implementation of AdamW, all of which have been shown to have little to no detrimental effect on performance.
153
 
@@ -177,12 +179,15 @@ The code used to create EmuBert may be found [here](https://github.com/umarbutle
177
  ## Benchmarks πŸ“Š
178
  EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-main.240) of 2.05 against [version 2.0.0](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa/tree/b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae) of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, outperforming all known state-of-the-art masked language models, as shown below:
179
 
180
- | Model | Perplexity |
181
- | -------------- | ---------- |
182
- | **EmuBert** | **2.05** |
183
- | Roberta | 2.38 |
184
- | Bert (cased) | 2.18 |
185
- | Bert (uncased) | 2.41 |
 
 
 
186
 
187
  ## Limitations 🚧
188
  Although informal testing has not revealed any racial, sexual, gender or other social biases, given that Roberta's weights were reused, it is possible that there may be some biases that have been transferred over to EmuBert. It is also possible that there are social biases present in the Corpus that may have been introduced via training. More rigorous testing is necessary to determine the true extent of any biases present in EmuBert.
 
23
  - text: >-
24
  A <mask> of trade is valid to the extent to which it is not against public
25
  policy, whether it is in severable terms or not.
26
+ - text: Norfolk Island is an Australian <mask>.
27
  - text: >-
28
  In Mabo v <mask> (No 2) (1992) 175 CLR 1, the Court found that the doctrine
29
  of terra nullius was not applicable to Australia at the time of British
 
53
 
54
  # EmuBert
55
  <img src="https://huggingface.co/umarbutler/emubert/resolve/main/logo.png" width="100" align="left" />
56
+
57
+ EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
58
 
59
  Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **semantic similarity**, **vector search** and general **sentence embedding**.
60
 
 
149
 
150
  The final training set comprised 2,885,839 blocks totalling 1,477,549,568 tokens, the validation set comprised 155,563 blocks totalling 79,648,256 tokens, and the test set comprised 155,696 blocks totalling 79,716,352 tokens.
151
 
152
+ Instead of training EmuBert from scratch, [Roberta](https://huggingface.co/roberta-base)'s weights were all reused, except for its token embeddings which were either replaced with the average token embedding or, if a token was shared between Roberta and EmuBert's vocabularies, moved to its new position in EmuBert's vocabulary, as described by Umar Butler in his blog post, [*How to reuse model weights when training with a new tokeniser*](https://umarbutler.com/how-to-reuse-model-weights-when-training-with-a-new-tokeniser/).
153
 
154
  In order to reduce training time, [Better Transformer](https://huggingface.co/docs/optimum/en/bettertransformer/overview) was used to enable fast path execution and scaled dot-product attention, alongside automatic mixed 16-bit precision and [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.AdamW8bit)' 8-bit implementation of AdamW, all of which have been shown to have little to no detrimental effect on performance.
155
 
 
179
  ## Benchmarks πŸ“Š
180
  EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-main.240) of 2.05 against [version 2.0.0](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa/tree/b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae) of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, outperforming all known state-of-the-art masked language models, as shown below:
181
 
182
+ | Model | Perplexity |
183
+ | ----------------------- | ---------- |
184
+ | **EmuBert** | **2.05** |
185
+ | Bert (cased) | 2.18 |
186
+ | Legal-Bert | 2.33 |
187
+ | Roberta | 2.38 |
188
+ | Bert (uncased) | 2.41 |
189
+ | Legalbert (casehold) | 3.08 |
190
+ | Legalbert (pile-of-law) | 4.41 |
191
 
192
  ## Limitations 🚧
193
  Although informal testing has not revealed any racial, sexual, gender or other social biases, given that Roberta's weights were reused, it is possible that there may be some biases that have been transferred over to EmuBert. It is also possible that there are social biases present in the Corpus that may have been introduced via training. More rigorous testing is necessary to determine the true extent of any biases present in EmuBert.