Jack Morris
commited on
Commit
β’
66e6e6c
1
Parent(s):
53aaec6
README formatting
Browse files
README.md
CHANGED
@@ -8648,10 +8648,14 @@ model-index:
|
|
8648 |
Our new model that naturally integrates "context tokens" into the embedding process. As of October 1st, 2024, `cde-small-v1` is the best small model (under 400M params) on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for text embedding models, with an average score of 65.00.
|
8649 |
|
8650 |
π <b><a href="https://colab.research.google.com/drive/1r8xwbp7_ySL9lP-ve4XMJAHjidB9UkbL?usp=sharing">Try on Colab</a></b>
|
|
|
8651 |
π <b><a href="https://arxiv.org/abs/2410.02525">Contextual Document Embeddings (ArXiv)</a></b>
|
8652 |
|
8653 |
![CDE Overview Figure](https://i.imgur.com/LyXJZjM.png)
|
8654 |
|
|
|
|
|
|
|
8655 |
# How to use `cde-small-v1`
|
8656 |
|
8657 |
Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
|
|
|
8648 |
Our new model that naturally integrates "context tokens" into the embedding process. As of October 1st, 2024, `cde-small-v1` is the best small model (under 400M params) on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for text embedding models, with an average score of 65.00.
|
8649 |
|
8650 |
π <b><a href="https://colab.research.google.com/drive/1r8xwbp7_ySL9lP-ve4XMJAHjidB9UkbL?usp=sharing">Try on Colab</a></b>
|
8651 |
+
<br>
|
8652 |
π <b><a href="https://arxiv.org/abs/2410.02525">Contextual Document Embeddings (ArXiv)</a></b>
|
8653 |
|
8654 |
![CDE Overview Figure](https://i.imgur.com/LyXJZjM.png)
|
8655 |
|
8656 |
+
<br>
|
8657 |
+
<hr>
|
8658 |
+
|
8659 |
# How to use `cde-small-v1`
|
8660 |
|
8661 |
Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
|