Create README.md

Browse files

Files changed (1) hide show

README.md +102 -0

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+datasets:
+- sentence-transformers/embedding-training-data
+- flax-sentence-embeddings/stackexchange_xml
+- snli
+- eli5
+- search_qa
+- multi_nli
+- wikihow
+- natural_questions
+- trivia_qa
+- ms_marco
+- gooaq
+- yahoo_answers_topics
+language:
+- en
+inference: false
+pipeline_tag: sentence-similarity
+task_categories:
+  - sentence-similarity
+  - feature-extraction
+  - text-retrieval
+tags:
+  - information retrieval
+  - ir
+  - documents retrieval
+  - passage retrieval
+  - beir
+  - benchmark
+  - sts
+  - semantic search
+  - sentence-transformers
+  - feature-extraction
+  - sentence-similarity
+  - transformers
+---
+# bert-base-1024-biencoder-64M-pairs
+A long context biencoder based on [MosaicML's BERT pretrained on 1024 sequence length](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024). This model maps sentences & paragraphs to a 768 dimensional dense vector space
+and can be used for tasks like clustering or semantic search.
+## Usage
+### Download the model and related scripts
+```git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs```
+### Inference
+```python
+import torch
+from torch import nn
+from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
+from mosaic_bert import BertModel
+# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0
+class AutoModelForSentenceEmbedding(nn.Module):
+    def __init__(self, model, tokenizer, normalize=True):
+        super(AutoModelForSentenceEmbedding, self).__init__()
+        self.model = model.to("cuda")
+        self.normalize = normalize
+        self.tokenizer = tokenizer
+    def forward(self, **kwargs):
+        model_output = self.model(**kwargs)
+        embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
+        if self.normalize:
+            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
+        return embeddings
+    def mean_pooling(self, model_output, attention_mask):
+        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
+        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
+model = AutoModelForSentenceEmbedding(model, tokenizer)
+tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+sentences = ["This is an example sentence", "Each sentence is converted"]
+encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
+embeddings = model(**encoded_input)
+print(embeddings)
+print(embeddings.shape)
+```
+## Other details
+### Training
+This model has been trained on 64M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the
+training set [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#training-data).
+The training (along with hyperparameters), inference and data loading scripts can all be found in [this Github repository](https://github.com/shreyansh26/Long-Context-Biencoder).
+### Evaluations
+We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are [here](https://github.com/shreyansh26/Long-Context-Biencoder/tree/master/models/results/64M_results).