Jack Morris
commited on
Commit
•
9792ab1
1
Parent(s):
6ab272b
reorg readme
Browse files
README.md
CHANGED
@@ -8662,6 +8662,111 @@ Our new model that naturally integrates "context tokens" into the embedding proc
|
|
8662 |
|
8663 |
Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
|
8664 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8665 |
## With Sentence Transformers
|
8666 |
|
8667 |
<details open="">
|
@@ -8832,110 +8937,6 @@ Top Document: Foster's Home for Imaginary Friends McCracken conceived the series
|
|
8832 |
|
8833 |
</details>
|
8834 |
|
8835 |
-
</details>
|
8836 |
-
|
8837 |
-
## With Transformers
|
8838 |
-
|
8839 |
-
<details>
|
8840 |
-
<summary>Click to learn how to use cde-small-v1 with Transformers</summary>
|
8841 |
-
|
8842 |
-
### Loading the model
|
8843 |
-
|
8844 |
-
Our model can be loaded using `transformers` out-of-the-box with "trust remote code" enabled. We use the default BERT uncased tokenizer:
|
8845 |
-
```python
|
8846 |
-
import transformers
|
8847 |
-
|
8848 |
-
model = transformers.AutoModel.from_pretrained("jxm/cde-small-v1", trust_remote_code=True)
|
8849 |
-
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
|
8850 |
-
```
|
8851 |
-
|
8852 |
-
#### Note on prefixes
|
8853 |
-
|
8854 |
-
*Nota bene*: Like all state-of-the-art embedding models, our model was trained with task-specific prefixes. To do retrieval, you can prepend the following strings to queries & documents:
|
8855 |
-
|
8856 |
-
```python
|
8857 |
-
query_prefix = "search_query: "
|
8858 |
-
document_prefix = "search_document: "
|
8859 |
-
```
|
8860 |
-
|
8861 |
-
### First stage
|
8862 |
-
|
8863 |
-
```python
|
8864 |
-
minicorpus_size = model.config.transductive_corpus_size
|
8865 |
-
minicorpus_docs = [ ... ] # Put some strings here that are representative of your corpus, for example by calling random.sample(corpus, k=minicorpus_size)
|
8866 |
-
assert len(minicorpus_docs) == minicorpus_size # You must use exactly this many documents in the minicorpus. You can oversample if your corpus is smaller.
|
8867 |
-
minicorpus_docs = tokenizer(
|
8868 |
-
[document_prefix + doc for doc in minicorpus_docs],
|
8869 |
-
truncation=True,
|
8870 |
-
padding=True,
|
8871 |
-
max_length=512,
|
8872 |
-
return_tensors="pt"
|
8873 |
-
).to(model.device)
|
8874 |
-
import torch
|
8875 |
-
from tqdm.autonotebook import tqdm
|
8876 |
-
|
8877 |
-
batch_size = 32
|
8878 |
-
|
8879 |
-
dataset_embeddings = []
|
8880 |
-
for i in tqdm(range(0, len(minicorpus_docs["input_ids"]), batch_size)):
|
8881 |
-
minicorpus_docs_batch = {k: v[i:i+batch_size] for k,v in minicorpus_docs.items()}
|
8882 |
-
with torch.no_grad():
|
8883 |
-
dataset_embeddings.append(
|
8884 |
-
model.first_stage_model(**minicorpus_docs_batch)
|
8885 |
-
)
|
8886 |
-
|
8887 |
-
dataset_embeddings = torch.cat(dataset_embeddings)
|
8888 |
-
```
|
8889 |
-
|
8890 |
-
### Running the second stage
|
8891 |
-
|
8892 |
-
Now that we have obtained "dataset embeddings" we can embed documents and queries like normal. Remember to use the document prefix for documents:
|
8893 |
-
```python
|
8894 |
-
docs = tokenizer(
|
8895 |
-
[document_prefix + doc for doc in docs],
|
8896 |
-
truncation=True,
|
8897 |
-
padding=True,
|
8898 |
-
max_length=512,
|
8899 |
-
return_tensors="pt"
|
8900 |
-
).to(model.device)
|
8901 |
-
|
8902 |
-
with torch.no_grad():
|
8903 |
-
doc_embeddings = model.second_stage_model(
|
8904 |
-
input_ids=docs["input_ids"],
|
8905 |
-
attention_mask=docs["attention_mask"],
|
8906 |
-
dataset_embeddings=dataset_embeddings,
|
8907 |
-
)
|
8908 |
-
doc_embeddings /= doc_embeddings.norm(p=2, dim=1, keepdim=True)
|
8909 |
-
```
|
8910 |
-
|
8911 |
-
and the query prefix for queries:
|
8912 |
-
```python
|
8913 |
-
queries = queries.select(range(16))["text"]
|
8914 |
-
queries = tokenizer(
|
8915 |
-
[query_prefix + query for query in queries],
|
8916 |
-
truncation=True,
|
8917 |
-
padding=True,
|
8918 |
-
max_length=512,
|
8919 |
-
return_tensors="pt"
|
8920 |
-
).to(model.device)
|
8921 |
-
|
8922 |
-
with torch.no_grad():
|
8923 |
-
query_embeddings = model.second_stage_model(
|
8924 |
-
input_ids=queries["input_ids"],
|
8925 |
-
attention_mask=queries["attention_mask"],
|
8926 |
-
dataset_embeddings=dataset_embeddings,
|
8927 |
-
)
|
8928 |
-
query_embeddings /= query_embeddings.norm(p=2, dim=1, keepdim=True)
|
8929 |
-
```
|
8930 |
-
|
8931 |
-
these embeddings can be compared using dot product, since they're normalized.
|
8932 |
-
|
8933 |
-
</details>
|
8934 |
-
|
8935 |
-
### What if I don't know what my corpus will be ahead of time?
|
8936 |
-
|
8937 |
-
If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v1/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
|
8938 |
-
|
8939 |
### Colab demo
|
8940 |
|
8941 |
We've set up a short demo in a Colab notebook showing how you might use our model:
|
|
|
8662 |
|
8663 |
Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
|
8664 |
|
8665 |
+
</details>
|
8666 |
+
|
8667 |
+
## With Transformers
|
8668 |
+
|
8669 |
+
<details>
|
8670 |
+
<summary>Click to learn how to use cde-small-v1 with Transformers</summary>
|
8671 |
+
|
8672 |
+
### Loading the model
|
8673 |
+
|
8674 |
+
Our model can be loaded using `transformers` out-of-the-box with "trust remote code" enabled. We use the default BERT uncased tokenizer:
|
8675 |
+
```python
|
8676 |
+
import transformers
|
8677 |
+
|
8678 |
+
model = transformers.AutoModel.from_pretrained("jxm/cde-small-v1", trust_remote_code=True)
|
8679 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
|
8680 |
+
```
|
8681 |
+
|
8682 |
+
#### Note on prefixes
|
8683 |
+
|
8684 |
+
*Nota bene*: Like all state-of-the-art embedding models, our model was trained with task-specific prefixes. To do retrieval, you can prepend the following strings to queries & documents:
|
8685 |
+
|
8686 |
+
```python
|
8687 |
+
query_prefix = "search_query: "
|
8688 |
+
document_prefix = "search_document: "
|
8689 |
+
```
|
8690 |
+
|
8691 |
+
### First stage
|
8692 |
+
|
8693 |
+
```python
|
8694 |
+
minicorpus_size = model.config.transductive_corpus_size
|
8695 |
+
minicorpus_docs = [ ... ] # Put some strings here that are representative of your corpus, for example by calling random.sample(corpus, k=minicorpus_size)
|
8696 |
+
assert len(minicorpus_docs) == minicorpus_size # You must use exactly this many documents in the minicorpus. You can oversample if your corpus is smaller.
|
8697 |
+
minicorpus_docs = tokenizer(
|
8698 |
+
[document_prefix + doc for doc in minicorpus_docs],
|
8699 |
+
truncation=True,
|
8700 |
+
padding=True,
|
8701 |
+
max_length=512,
|
8702 |
+
return_tensors="pt"
|
8703 |
+
).to(model.device)
|
8704 |
+
import torch
|
8705 |
+
from tqdm.autonotebook import tqdm
|
8706 |
+
|
8707 |
+
batch_size = 32
|
8708 |
+
|
8709 |
+
dataset_embeddings = []
|
8710 |
+
for i in tqdm(range(0, len(minicorpus_docs["input_ids"]), batch_size)):
|
8711 |
+
minicorpus_docs_batch = {k: v[i:i+batch_size] for k,v in minicorpus_docs.items()}
|
8712 |
+
with torch.no_grad():
|
8713 |
+
dataset_embeddings.append(
|
8714 |
+
model.first_stage_model(**minicorpus_docs_batch)
|
8715 |
+
)
|
8716 |
+
|
8717 |
+
dataset_embeddings = torch.cat(dataset_embeddings)
|
8718 |
+
```
|
8719 |
+
|
8720 |
+
### Running the second stage
|
8721 |
+
|
8722 |
+
Now that we have obtained "dataset embeddings" we can embed documents and queries like normal. Remember to use the document prefix for documents:
|
8723 |
+
```python
|
8724 |
+
docs = tokenizer(
|
8725 |
+
[document_prefix + doc for doc in docs],
|
8726 |
+
truncation=True,
|
8727 |
+
padding=True,
|
8728 |
+
max_length=512,
|
8729 |
+
return_tensors="pt"
|
8730 |
+
).to(model.device)
|
8731 |
+
|
8732 |
+
with torch.no_grad():
|
8733 |
+
doc_embeddings = model.second_stage_model(
|
8734 |
+
input_ids=docs["input_ids"],
|
8735 |
+
attention_mask=docs["attention_mask"],
|
8736 |
+
dataset_embeddings=dataset_embeddings,
|
8737 |
+
)
|
8738 |
+
doc_embeddings /= doc_embeddings.norm(p=2, dim=1, keepdim=True)
|
8739 |
+
```
|
8740 |
+
|
8741 |
+
and the query prefix for queries:
|
8742 |
+
```python
|
8743 |
+
queries = queries.select(range(16))["text"]
|
8744 |
+
queries = tokenizer(
|
8745 |
+
[query_prefix + query for query in queries],
|
8746 |
+
truncation=True,
|
8747 |
+
padding=True,
|
8748 |
+
max_length=512,
|
8749 |
+
return_tensors="pt"
|
8750 |
+
).to(model.device)
|
8751 |
+
|
8752 |
+
with torch.no_grad():
|
8753 |
+
query_embeddings = model.second_stage_model(
|
8754 |
+
input_ids=queries["input_ids"],
|
8755 |
+
attention_mask=queries["attention_mask"],
|
8756 |
+
dataset_embeddings=dataset_embeddings,
|
8757 |
+
)
|
8758 |
+
query_embeddings /= query_embeddings.norm(p=2, dim=1, keepdim=True)
|
8759 |
+
```
|
8760 |
+
|
8761 |
+
these embeddings can be compared using dot product, since they're normalized.
|
8762 |
+
|
8763 |
+
</details>
|
8764 |
+
|
8765 |
+
### What if I don't know what my corpus will be ahead of time?
|
8766 |
+
|
8767 |
+
If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v1/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
|
8768 |
+
|
8769 |
+
|
8770 |
## With Sentence Transformers
|
8771 |
|
8772 |
<details open="">
|
|
|
8937 |
|
8938 |
</details>
|
8939 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8940 |
### Colab demo
|
8941 |
|
8942 |
We've set up a short demo in a Colab notebook showing how you might use our model:
|