File size: 5,859 Bytes
79eaa69
74e035e
79eaa69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e3a325
938b19c
ac1fb77
 
 
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
 
 
938b19c
 
3fcf8f5
 
938b19c
3fcf8f5
938b19c
 
3fcf8f5
938b19c
3fcf8f5
938b19c
 
 
 
 
3fcf8f5
 
938b19c
 
3fcf8f5
 
 
938b19c
 
 
 
 
 
 
 
 
 
 
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
 
 
3fcf8f5
 
 
 
 
 
 
 
 
 
938b19c
 
3fcf8f5
938b19c
 
 
 
 
 
 
 
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
938b19c
3fcf8f5
 
 
 
 
 
 
 
938b19c
3fcf8f5
938b19c
3fcf8f5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: mit
datasets:
- sujet-ai/Sujet-Financial-RAG-EN-Dataset
language:
- en
metrics:
- accuracy
pipeline_tag: sentence-similarity
tags:
- finance
- embedding
- embedding model
- financial qa
- bge
- sentence transformers
- financial rag
---
# Marsilia-Embeddings-EN-Base πŸš€

<img src="eval_en.jpg" width="1500" height="1000">


## Introduction 🌟

**Marsilia-Embeddings-EN-Base** is an English language embedding model specifically designed for financial domain tasks. This model serves as a proof of concept, demonstrating the critical importance of fine-tuning embedding models for specific tasks in Retrieval-Augmented Generation (RAG) applications. 

By focusing on the financial domain, Marsilia-Embeddings-EN-Base achieves performance that surpasses even closed-source models like OpenAI's embeddings, while offering a more cost-effective solution. This showcases how targeted fine-tuning can dramatically enhance the capabilities of open-source models, making them competitive with or even superior to proprietary alternatives in specialized domains.

## Model Details πŸ“Š

- **Model Type:** Sentence Transformer
- **Language:** English πŸ‡¬πŸ‡§
- **Base Model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768
- **Similarity Function:** Cosine Similarity

## Usage πŸ’»

To use this model with the Sentence Transformers library:

```python
from sentence_transformers import SentenceTransformer

# Download from the πŸ€— Hub
model = SentenceTransformer("sujet-ai/Marsilia-Embeddings-EN-Base")

# Run inference
sentences = [
    'What are the key factors affecting the performance of corporate bonds in the current market?',
    'The corporate bond market has been influenced by several factors in recent months. Interest rates set by central banks have a significant impact, as rising rates tend to decrease bond prices and increase yields. Economic indicators such as GDP growth, inflation rates, and employment figures also play a role in shaping investor sentiment and corporate financial health. Industry-specific trends and individual company performance are crucial, with factors like earnings reports, credit ratings, and debt levels affecting bond valuations. Global events, including geopolitical tensions and trade policies, can create market volatility. Liquidity in the bond market and overall investor risk appetite are additional considerations. It's important for investors to monitor these various factors when assessing corporate bond performance.',
    'CORPORATE BOND HOLDINGS (Continued) Principal Amount (000) Coupon Rate Maturity Date Market Value ($000) Vanguard Short-Term Corporate Bond ETF Bank of America Corp. 2,285 5.015% 1/22/24 2,285 JPMorgan Chase & Co. 2,250 3.875% 2/1/24 2,249 Goldman Sachs Group Inc. 2,200 3.750% 2/25/24 2,197 Morgan Stanley 2,190 3.875% 1/27/24 2,189 Citigroup Inc. 2,145 3.875% 3/26/24 2,141 Wells Fargo & Co. 2,100 3.750% 1/24/24 2,099 Bank of America Corp. 2,050 4.000% 4/1/24 2,047 Truist Bank 2,000 3.800% 10/30/23 2,000 PNC Bank NA 1,950 3.800% 7/25/23 1,950 U.S. Bancorp 1,900 3.375% 2/5/24 1,896 Bank of America Corp. 1,850 4.125% 1/22/24 1,850 Morgan Stanley 1,800 3.737% 4/24/24 1,795 Citigroup Inc. 1,750 3.668% 7/24/24 1,740 Goldman Sachs Group Inc. 1,700 3.625% 1/22/23 1,700 Wells Fargo & Co. 1,650 3.550% 8/14/23 1,650 JPMorgan Chase & Co. 1,600 3.875% 9/10/24 1,593'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

## Intended Use 🎯

This model is designed for generating sentence embeddings for English text, particularly in the financial domain. It can be used for various natural language processing tasks such as semantic search, clustering, and information retrieval.

## Training Data πŸ“š

The model was fine-tuned on the [sujet-ai/Sujet-Financial-RAG-EN-Dataset](https://huggingface.co/datasets/sujet-ai/Sujet-Financial-RAG-EN-Dataset). This dataset consists of question-context pairs in English, focusing on financial topics.

## Training Procedure πŸ› οΈ

### Training Hyperparameters

- **Loss Function:** MultipleNegativesRankingLoss
  - Scale: 20.0
  - Similarity Function: Cosine Similarity
- **Evaluation Strategy:** Steps
- **Per Device Train Batch Size:** 200
- **Per Device Eval Batch Size:** 200
- **Number of Train Epochs:** 10
- **Batch Sampler:** no_duplicates
- **Multi Dataset Batch Sampler:** round_robin
- **Scheduler:** Warmup cosine

### Framework Versions

- Python: 3.10.13
- Sentence Transformers: 3.0.1
- Transformers: 4.42.3
- PyTorch: 2.5.0.dev20240704+cu124
- Accelerate: 0.32.1
- Datasets: 2.20.0
- Tokenizers: 0.19.1

## Evaluation πŸ“ˆ

The model was evaluated using the `InformationRetrievalEvaluator` on the test split of the [sujet-ai/Sujet-Financial-RAG-EN-Dataset](https://huggingface.co/datasets/sujet-ai/Sujet-Financial-RAG-EN-Dataset).

## Limitations ⚠️

The model is specifically trained on English financial texts and may not perform optimally on other domains or languages. Users should be aware of potential biases present in the training data.

## Citation πŸ“„

If you use this model in your research or applications, please cite:

```bibtex
@software{Marsilia-Embeddings-EN-Base,
  author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
  title = {Marsilia-Embeddings-EN-Base: A fine-tuned English embedding model for financial texts},
  year = {2024},
  url = {https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Base}
}
```

## Contact Information πŸ“§

For questions, feedback, or collaborations, please reach out to us on [LinkedIn](https://www.linkedin.com/company/sujet-ai/) or visit our website [https://sujet.ai](https://sujet.ai).