umarbutler commited on
Commit
47c02e2
β€’
1 Parent(s): 74f9878

Noted that the quality of EmuBert's embeddings may not be

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -9,7 +9,6 @@ tags:
9
  - legal
10
  - australia
11
  - generated_from_trainer
12
- - sentence-similarity
13
  - feature-extraction
14
  - fill-mask
15
  datasets:
@@ -63,14 +62,14 @@ co2_eq_emissions:
63
 
64
  EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
65
 
66
- Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **text similarity**, **clustering** and general **sentence embedding**.
67
 
68
  To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
69
 
70
  ## Usage πŸ‘©β€πŸ’»
71
  Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
72
 
73
- It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
74
  ```python
75
  import math
76
  import torch
 
9
  - legal
10
  - australia
11
  - generated_from_trainer
 
12
  - feature-extraction
13
  - fill-mask
14
  datasets:
 
62
 
63
  EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
64
 
65
+ Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition**, **semantic similarity** and **question answering**.
66
 
67
  To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
68
 
69
  ## Usage πŸ‘©β€πŸ’»
70
  Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
71
 
72
+ It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to peform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
73
  ```python
74
  import math
75
  import torch