mhaseeb1604/bge-m3-law
This model is a fine-tuned version of the BAAI/bge-m3 model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more.
Model Overview
- Architecture: Based on sentence-transformers.
- Training Data: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English.
- Embedding Size: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text.
- Applications: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context.
Installation
To use this model, you need to have the sentence-transformers
library installed. You can install it via pip:
pip install -U sentence-transformers
Usage
You can easily load and use this model in Python with the following code:
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('mhaseeb1604/bge-m3-law')
# Sample sentences
sentences = ["This is an example sentence", "Each sentence is converted"]
# Generate embeddings
embeddings = model.encode(sentences)
# Output embeddings
print(embeddings)
Model Training
The model was fine-tuned on Arabic and English legal texts using the following configurations:
- DataLoader:
- Batch size: 4
- Sampler: SequentialSampler
- Loss Function:
MultipleNegativesRankingLoss
with cosine similarity. - Optimizer: AdamW with learning rate
2e-05
. - Training Parameters:
- Epochs: 2
- Warmup Steps: 20
- Weight Decay: 0.01
Full Model Architecture
This model consists of three main components:
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False})
(2): Normalize()
)
- Transformer Layer: Uses XLM-Roberta model with a max sequence length of 8192.
- Pooling Layer: Utilizes CLS token pooling to generate sentence embeddings.
- Normalization Layer: Ensures normalized output vectors for better performance in similarity tasks.
Citing & Authors
If you find this repository useful, please consider giving a star : and citation
@misc {muhammad_haseeb_2024,
author = { {Muhammad Haseeb} },
title = { bge-m3-law (Revision 2fc0289) },
year = 2024,
url = { https://huggingface.co/mhaseeb1604/bge-m3-law },
doi = { 10.57967/hf/3217 },
publisher = { Hugging Face }
}
- Downloads last month
- 418
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for mhaseeb1604/bge-m3-law
Base model
BAAI/bge-m3