Edit model card

AusLaw Embedding Model v1.0

This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model is a fine-tune of BAAI/bge-small-en using the HCA case law in the Open Australian Legal Corpus by Umar Butler. The PDF/OCR cases were not used.

The cases were split into < 512 context chunks using the bge-small-en tokeniser and semchunk.

mistralai/Mixtral-8x7B-Instruct-v0.1 was used to generate a legal question for each context chunk.

129,137 context-question pairs were used for training.

14,348 context-question pairs were used for evaluation (see the table below for results).

Using a 10% subset of the val dataset the following hit-rate performance was reached and is compared to the base model and OpenAI's default ada embedding model.

Model Avg. hit-rate
BAAI/bge-small-en 89%
OpenAI 92%
adlumal/auslaw-embed-v1.0 97%

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('adlumal/auslaw-embed-v1.0')
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

The model was evauluated on 10% of the available data. The automated eval results for the final step are presented below.

Eval Score
cos_sim-Accuracy@1 0.730206301
cos_sim-Accuracy@3 0.859562308
cos_sim-Accuracy@5 0.892737664
cos_sim-Accuracy@10 0.928352384
cos_sim-Precision@1 0.730206301
cos_sim-Recall@1 0.730206301
cos_sim-Precision@3 0.286520769
cos_sim-Recall@3 0.859562308
cos_sim-Precision@5 0.178547533
cos_sim-Recall@5 0.892737664
cos_sim-Precision@10 0.092835238
cos_sim-Recall@10 0.928352384
cos_sim-MRR@10 0.801075782
cos_sim-NDCG@10 0.832189447
cos_sim-MAP@100 0.803593645
dot_score-Accuracy@1 0.730136604
dot_score-Accuracy@3 0.859562308
dot_score-Accuracy@5 0.892737664
dot_score-Accuracy@10 0.928352384
dot_score-Precision@1 0.730136604
dot_score-Recall@1 0.730136604
dot_score-Precision@3 0.286520769
dot_score-Recall@3 0.859562308
dot_score-Precision@5 0.178547533
dot_score-Recall@5 0.892737664
dot_score-Precision@10 0.092835238
dot_score-Recall@10 0.928352384
dot_score-MRR@10 0.801040934
dot_score-NDCG@10 0.832163724
dot_score-MAP@100 0.803558796

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 2583 with parameters:

{'batch_size': 50, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 2,
    "evaluation_steps": 50,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 516,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Citing & Authors

@misc{malec-2024-auslaw-embed-v1,
    author = {Malec, Adrian Lucas},
    year = {2024},
    title = {AusLaw Embedding v1.0},
    publisher = {Hugging Face},
    version = {1.0},
    url = {https://huggingface.co/adlumal/auslaw-embed-v1.0}
}
Downloads last month
49
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train adlumal/auslaw-embed-v1.0