umarbutler
/

emubert

@@ -9,7 +9,6 @@ tags:
 - legal
 - australia
 - generated_from_trainer
-- sentence-similarity
 - feature-extraction
 - fill-mask
 datasets:
@@ -63,14 +62,14 @@ co2_eq_emissions:
 EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
-Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **text similarity**, **clustering** and general **sentence embedding**.
 To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
 ## Usage 👩‍💻
 Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
-It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
 ```python
 import math
 import torch

 - legal
 - australia
 - generated_from_trainer
 - feature-extraction
 - fill-mask
 datasets:
 EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
+Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition**, **semantic similarity** and **question answering**.
 To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
 ## Usage 👩‍💻
 Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
+It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to peform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
 ```python
 import math
 import torch