avsolatorio
commited on
Commit
•
7831200
1
Parent(s):
9b911f6
Update README.md
Browse files
README.md
CHANGED
@@ -2608,28 +2608,29 @@ model-index:
|
|
2608 |
---
|
2609 |
<h1 align="center">GIST Large Embedding v0</h1>
|
2610 |
|
2611 |
-
*
|
2612 |
|
2613 |
The model is fine-tuned on top of the [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
2614 |
|
2615 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
2616 |
|
2617 |
-
Technical
|
|
|
2618 |
|
2619 |
# Data
|
2620 |
|
2621 |
-
The dataset used is a compilation of the MEDI
|
2622 |
|
2623 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
2624 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
2625 |
|
2626 |
-
The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
2627 |
|
2628 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
2629 |
|
2630 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
2631 |
|
2632 |
-
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation.
|
2633 |
|
2634 |
# Usage
|
2635 |
|
@@ -2639,7 +2640,7 @@ The model can be easily loaded using the Sentence Transformers library.
|
|
2639 |
import torch.nn.functional as F
|
2640 |
from sentence_transformers import SentenceTransformer
|
2641 |
|
2642 |
-
revision = None # Replace with the specific revision to ensure reproducibility
|
2643 |
|
2644 |
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
|
2645 |
|
@@ -2671,13 +2672,29 @@ Checkpoint step = 171000
|
|
2671 |
Contrastive loss temperature = 0.01
|
2672 |
```
|
2673 |
|
2674 |
-
Specific training details and strategies will be published shortly.
|
2675 |
|
2676 |
# Evaluation
|
2677 |
|
2678 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
2679 |
|
2680 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2681 |
# Acknowledgements
|
2682 |
|
2683 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|
|
|
2608 |
---
|
2609 |
<h1 align="center">GIST Large Embedding v0</h1>
|
2610 |
|
2611 |
+
*GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
|
2612 |
|
2613 |
The model is fine-tuned on top of the [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
2614 |
|
2615 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
2616 |
|
2617 |
+
Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
|
2618 |
+
|
2619 |
|
2620 |
# Data
|
2621 |
|
2622 |
+
The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
|
2623 |
|
2624 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
2625 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
2626 |
|
2627 |
+
The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
2628 |
|
2629 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
2630 |
|
2631 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
2632 |
|
2633 |
+
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
|
2634 |
|
2635 |
# Usage
|
2636 |
|
|
|
2640 |
import torch.nn.functional as F
|
2641 |
from sentence_transformers import SentenceTransformer
|
2642 |
|
2643 |
+
revision = None # Replace with the specific revision to ensure reproducibility if the model is updated.
|
2644 |
|
2645 |
model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
|
2646 |
|
|
|
2672 |
Contrastive loss temperature = 0.01
|
2673 |
```
|
2674 |
|
|
|
2675 |
|
2676 |
# Evaluation
|
2677 |
|
2678 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
2679 |
|
2680 |
|
2681 |
+
# Citation
|
2682 |
+
|
2683 |
+
Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
|
2684 |
+
|
2685 |
+
```
|
2686 |
+
@article{solatorio2024gistembed,
|
2687 |
+
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
|
2688 |
+
author={Aivin V. Solatorio},
|
2689 |
+
journal={arXiv preprint arXiv:2402.16829},
|
2690 |
+
year={2024},
|
2691 |
+
URL={https://arxiv.org/abs/2402.16829}
|
2692 |
+
eprint={2402.16829},
|
2693 |
+
archivePrefix={arXiv},
|
2694 |
+
primaryClass={cs.LG}
|
2695 |
+
}
|
2696 |
+
```
|
2697 |
+
|
2698 |
# Acknowledgements
|
2699 |
|
2700 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|