avsolatorio commited on
Commit
bf6b2e5
1 Parent(s): 71128f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -7
README.md CHANGED
@@ -4580,28 +4580,29 @@ model-index:
4580
  ---
4581
  <h1 align="center">GIST Embedding v0</h1>
4582
 
4583
- *GIST Embedding: Guided In-sample Selection of Training Negatives for Text Embedding*
4584
 
4585
  The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
4586
 
4587
  The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
4588
 
4589
- Technical details of the model will be published shortly.
 
4590
 
4591
  # Data
4592
 
4593
- The dataset used is a compilation of the MEDI dataset and the MTEB Classification training dataset. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
4594
 
4595
  - Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
4596
  - Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
4597
 
4598
- The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
4599
 
4600
  The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
4601
 
4602
  The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
4603
 
4604
- The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation. Further work is currently being undertaken to validate this hypothesis.
4605
 
4606
  # Usage
4607
 
@@ -4611,7 +4612,7 @@ The model can be easily loaded using the Sentence Transformers library.
4611
  import torch.nn.functional as F
4612
  from sentence_transformers import SentenceTransformer
4613
 
4614
- revision = None # Replace with the specific revision to ensure reproducibility in case the model is updated.
4615
 
4616
  model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
4617
 
@@ -4643,13 +4644,29 @@ Checkpoint step = 103500
4643
  Contrastive loss temperature = 0.01
4644
  ```
4645
 
4646
- Specific training details and strategies will be published shortly.
4647
 
4648
  # Evaluation
4649
 
4650
  The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
4651
 
4652
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4653
  # Acknowledgements
4654
 
4655
  This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
 
4580
  ---
4581
  <h1 align="center">GIST Embedding v0</h1>
4582
 
4583
+ *GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
4584
 
4585
  The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
4586
 
4587
  The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
4588
 
4589
+ Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
4590
+
4591
 
4592
  # Data
4593
 
4594
+ The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
4595
 
4596
  - Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
4597
  - Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
4598
 
4599
+ The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
4600
 
4601
  The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
4602
 
4603
  The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
4604
 
4605
+ The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
4606
 
4607
  # Usage
4608
 
 
4612
  import torch.nn.functional as F
4613
  from sentence_transformers import SentenceTransformer
4614
 
4615
+ revision = None # Replace with the specific revision to ensure reproducibility if the model is updated.
4616
 
4617
  model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
4618
 
 
4644
  Contrastive loss temperature = 0.01
4645
  ```
4646
 
 
4647
 
4648
  # Evaluation
4649
 
4650
  The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
4651
 
4652
 
4653
+ # Citation
4654
+
4655
+ Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
4656
+
4657
+ ```
4658
+ @article{solatorio2024gistembed,
4659
+ title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
4660
+ author={Aivin V. Solatorio},
4661
+ journal={arXiv preprint arXiv:2402.16829},
4662
+ year={2024},
4663
+ URL={https://arxiv.org/abs/2402.16829}
4664
+ eprint={2402.16829},
4665
+ archivePrefix={arXiv},
4666
+ primaryClass={cs.LG}
4667
+ }
4668
+ ```
4669
+
4670
  # Acknowledgements
4671
 
4672
  This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.