umarbutler
commited on
Commit
β’
a498abe
1
Parent(s):
494cf32
Expanded benchmarks.
Browse files
README.md
CHANGED
@@ -23,6 +23,7 @@ widget:
|
|
23 |
- text: >-
|
24 |
A <mask> of trade is valid to the extent to which it is not against public
|
25 |
policy, whether it is in severable terms or not.
|
|
|
26 |
- text: >-
|
27 |
In Mabo v <mask> (No 2) (1992) 175 CLR 1, the Court found that the doctrine
|
28 |
of terra nullius was not applicable to Australia at the time of British
|
@@ -52,7 +53,8 @@ pipeline_tag: fill-mask
|
|
52 |
|
53 |
# EmuBert
|
54 |
<img src="https://huggingface.co/umarbutler/emubert/resolve/main/logo.png" width="100" align="left" />
|
55 |
-
|
|
|
56 |
|
57 |
Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **semantic similarity**, **vector search** and general **sentence embedding**.
|
58 |
|
@@ -147,7 +149,7 @@ Whereas the final block of the training set would have been dropped if it did no
|
|
147 |
|
148 |
The final training set comprised 2,885,839 blocks totalling 1,477,549,568 tokens, the validation set comprised 155,563 blocks totalling 79,648,256 tokens, and the test set comprised 155,696 blocks totalling 79,716,352 tokens.
|
149 |
|
150 |
-
Instead of training EmuBert from scratch, [Roberta](https://huggingface.co/roberta-base)'s weights were all reused, except for its token embeddings which were either replaced with the average token embedding or, if a token was shared between Roberta and EmuBert's vocabularies, moved to its new position in EmuBert's vocabulary.
|
151 |
|
152 |
In order to reduce training time, [Better Transformer](https://huggingface.co/docs/optimum/en/bettertransformer/overview) was used to enable fast path execution and scaled dot-product attention, alongside automatic mixed 16-bit precision and [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.AdamW8bit)' 8-bit implementation of AdamW, all of which have been shown to have little to no detrimental effect on performance.
|
153 |
|
@@ -177,12 +179,15 @@ The code used to create EmuBert may be found [here](https://github.com/umarbutle
|
|
177 |
## Benchmarks π
|
178 |
EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-main.240) of 2.05 against [version 2.0.0](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa/tree/b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae) of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, outperforming all known state-of-the-art masked language models, as shown below:
|
179 |
|
180 |
-
| Model
|
181 |
-
|
|
182 |
-
| **EmuBert**
|
183 |
-
|
|
184 |
-
| Bert
|
185 |
-
|
|
|
|
|
|
|
|
186 |
|
187 |
## Limitations π§
|
188 |
Although informal testing has not revealed any racial, sexual, gender or other social biases, given that Roberta's weights were reused, it is possible that there may be some biases that have been transferred over to EmuBert. It is also possible that there are social biases present in the Corpus that may have been introduced via training. More rigorous testing is necessary to determine the true extent of any biases present in EmuBert.
|
|
|
23 |
- text: >-
|
24 |
A <mask> of trade is valid to the extent to which it is not against public
|
25 |
policy, whether it is in severable terms or not.
|
26 |
+
- text: Norfolk Island is an Australian <mask>.
|
27 |
- text: >-
|
28 |
In Mabo v <mask> (No 2) (1992) 175 CLR 1, the Court found that the doctrine
|
29 |
of terra nullius was not applicable to Australia at the time of British
|
|
|
53 |
|
54 |
# EmuBert
|
55 |
<img src="https://huggingface.co/umarbutler/emubert/resolve/main/logo.png" width="100" align="left" />
|
56 |
+
|
57 |
+
EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
|
58 |
|
59 |
Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **semantic similarity**, **vector search** and general **sentence embedding**.
|
60 |
|
|
|
149 |
|
150 |
The final training set comprised 2,885,839 blocks totalling 1,477,549,568 tokens, the validation set comprised 155,563 blocks totalling 79,648,256 tokens, and the test set comprised 155,696 blocks totalling 79,716,352 tokens.
|
151 |
|
152 |
+
Instead of training EmuBert from scratch, [Roberta](https://huggingface.co/roberta-base)'s weights were all reused, except for its token embeddings which were either replaced with the average token embedding or, if a token was shared between Roberta and EmuBert's vocabularies, moved to its new position in EmuBert's vocabulary, as described by Umar Butler in his blog post, [*How to reuse model weights when training with a new tokeniser*](https://umarbutler.com/how-to-reuse-model-weights-when-training-with-a-new-tokeniser/).
|
153 |
|
154 |
In order to reduce training time, [Better Transformer](https://huggingface.co/docs/optimum/en/bettertransformer/overview) was used to enable fast path execution and scaled dot-product attention, alongside automatic mixed 16-bit precision and [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.AdamW8bit)' 8-bit implementation of AdamW, all of which have been shown to have little to no detrimental effect on performance.
|
155 |
|
|
|
179 |
## Benchmarks π
|
180 |
EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-main.240) of 2.05 against [version 2.0.0](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa/tree/b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae) of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, outperforming all known state-of-the-art masked language models, as shown below:
|
181 |
|
182 |
+
| Model | Perplexity |
|
183 |
+
| ----------------------- | ---------- |
|
184 |
+
| **EmuBert** | **2.05** |
|
185 |
+
| Bert (cased) | 2.18 |
|
186 |
+
| Legal-Bert | 2.33 |
|
187 |
+
| Roberta | 2.38 |
|
188 |
+
| Bert (uncased) | 2.41 |
|
189 |
+
| Legalbert (casehold) | 3.08 |
|
190 |
+
| Legalbert (pile-of-law) | 4.41 |
|
191 |
|
192 |
## Limitations π§
|
193 |
Although informal testing has not revealed any racial, sexual, gender or other social biases, given that Roberta's weights were reused, it is possible that there may be some biases that have been transferred over to EmuBert. It is also possible that there are social biases present in the Corpus that may have been introduced via training. More rigorous testing is necessary to determine the true extent of any biases present in EmuBert.
|