datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- UniHGKR
widget: []
UniHGKR-base-beir
Our paper: UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers.
The UniHGKR-base-beir model is derived from the UniHGKR-base model, further fine-tuned on MS MARCO for evaluation on the BEIR benchmark. We recommend using the sentence-transformers package to load our model and to perform embedding for paragraphs and sentences.
It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Evaluation on BEIR
The evaluation code can be found at https://github.com/ZhishanQ/UniHGKR.
Model Details
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Use the instructions to achieve the best performance from the model:
general_ins = "Given a question, retrieve relevant evidence that can answer the question from all knowledge sources:"
single_source_inst = "Given a question, retrieve relevant evidence that can answer the question from Text sources:"
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("ZhishanQ/UniHGKR-base-beir")
# Run inference
general_ins = "Given a question, retrieve relevant evidence that can answer the question from all knowledge sources:"
single_source_inst = "Given a question, retrieve relevant evidence that can answer the question from Text sources:"
sentences = [
'The weather is lovely today.',
"It's so sunny outside!",
'He drove to the stadium.',
]
# Prepend each sentence with the instruction
updated_sentences = [f"{single_source_inst} {sentence}" for sentence in sentences]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Framework Versions
- Python: 3.8.10
- Sentence Transformers: 3.0.1
- Transformers: 4.44.2
- PyTorch: 2.0.0+cu118
- Accelerate: 0.34.0
- Datasets: 2.21.0
- Tokenizers: 0.19.1
Sentence Transformers Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Citation
If you find this resource useful in your research, please consider giving a like and citation.
@article{min2024unihgkr,
title={UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers},
author={Min, Dehai and Xu, Zhiyang and Qi, Guilin and Huang, Lifu and You, Chenyu},
journal={arXiv preprint arXiv:2410.20163},
year={2024}
}