umarbutler
commited on
Commit
β’
47c02e2
1
Parent(s):
74f9878
Noted that the quality of EmuBert's embeddings may not be
Browse files
README.md
CHANGED
@@ -9,7 +9,6 @@ tags:
|
|
9 |
- legal
|
10 |
- australia
|
11 |
- generated_from_trainer
|
12 |
-
- sentence-similarity
|
13 |
- feature-extraction
|
14 |
- fill-mask
|
15 |
datasets:
|
@@ -63,14 +62,14 @@ co2_eq_emissions:
|
|
63 |
|
64 |
EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
|
65 |
|
66 |
-
Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition
|
67 |
|
68 |
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
|
69 |
|
70 |
## Usage π©βπ»
|
71 |
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
|
72 |
|
73 |
-
It is also possible to generate embeddings from the model which can be
|
74 |
```python
|
75 |
import math
|
76 |
import torch
|
|
|
9 |
- legal
|
10 |
- australia
|
11 |
- generated_from_trainer
|
|
|
12 |
- feature-extraction
|
13 |
- fill-mask
|
14 |
datasets:
|
|
|
62 |
|
63 |
EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
|
64 |
|
65 |
+
Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition**, **semantic similarity** and **question answering**.
|
66 |
|
67 |
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
|
68 |
|
69 |
## Usage π©βπ»
|
70 |
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
|
71 |
|
72 |
+
It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to peform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
|
73 |
```python
|
74 |
import math
|
75 |
import torch
|