avsolatorio commited on
Commit
7831200
1 Parent(s): 9b911f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -7
README.md CHANGED
@@ -2608,28 +2608,29 @@ model-index:
2608
  ---
2609
  <h1 align="center">GIST Large Embedding v0</h1>
2610
 
2611
- *GIST Embedding: Guided In-sample Selection of Training Negatives for Text Embedding*
2612
 
2613
  The model is fine-tuned on top of the [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
2614
 
2615
  The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
2616
 
2617
- Technical details of the model will be published shortly.
 
2618
 
2619
  # Data
2620
 
2621
- The dataset used is a compilation of the MEDI dataset and the MTEB Classification training dataset. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
2622
 
2623
  - Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
2624
  - Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
2625
 
2626
- The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
2627
 
2628
  The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
2629
 
2630
  The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
2631
 
2632
- The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation. Further work is currently being undertaken to validate this hypothesis.
2633
 
2634
  # Usage
2635
 
@@ -2639,7 +2640,7 @@ The model can be easily loaded using the Sentence Transformers library.
2639
  import torch.nn.functional as F
2640
  from sentence_transformers import SentenceTransformer
2641
 
2642
- revision = None # Replace with the specific revision to ensure reproducibility in case the model is updated.
2643
 
2644
  model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
2645
 
@@ -2671,13 +2672,29 @@ Checkpoint step = 171000
2671
  Contrastive loss temperature = 0.01
2672
  ```
2673
 
2674
- Specific training details and strategies will be published shortly.
2675
 
2676
  # Evaluation
2677
 
2678
  The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
2679
 
2680
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2681
  # Acknowledgements
2682
 
2683
  This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
 
2608
  ---
2609
  <h1 align="center">GIST Large Embedding v0</h1>
2610
 
2611
+ *GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
2612
 
2613
  The model is fine-tuned on top of the [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
2614
 
2615
  The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
2616
 
2617
+ Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
2618
+
2619
 
2620
  # Data
2621
 
2622
+ The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
2623
 
2624
  - Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
2625
  - Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
2626
 
2627
+ The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
2628
 
2629
  The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
2630
 
2631
  The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
2632
 
2633
+ The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
2634
 
2635
  # Usage
2636
 
 
2640
  import torch.nn.functional as F
2641
  from sentence_transformers import SentenceTransformer
2642
 
2643
+ revision = None # Replace with the specific revision to ensure reproducibility if the model is updated.
2644
 
2645
  model = SentenceTransformer("avsolatorio/GIST-large-Embedding-v0", revision=revision)
2646
 
 
2672
  Contrastive loss temperature = 0.01
2673
  ```
2674
 
 
2675
 
2676
  # Evaluation
2677
 
2678
  The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
2679
 
2680
 
2681
+ # Citation
2682
+
2683
+ Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
2684
+
2685
+ ```
2686
+ @article{solatorio2024gistembed,
2687
+ title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
2688
+ author={Aivin V. Solatorio},
2689
+ journal={arXiv preprint arXiv:2402.16829},
2690
+ year={2024},
2691
+ URL={https://arxiv.org/abs/2402.16829}
2692
+ eprint={2402.16829},
2693
+ archivePrefix={arXiv},
2694
+ primaryClass={cs.LG}
2695
+ }
2696
+ ```
2697
+
2698
  # Acknowledgements
2699
 
2700
  This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.