File size: 7,849 Bytes
5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 7243e76 5ecab76 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
language:
- zh
- en
base_model: openbmb/MiniCPM-2B-dpo-bf16
---
## RankCPM-R
**RankCPM-R** 是面壁智能与清华大学自然语言处理实验室(THUNLP)共同开发的中英双语言文本重排序模型,有如下特点:
- 出色的中文、英文重排序能力。
- 出色的中英跨语言重排序能力。
RankCPM-R 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练,结构上采取双向注意力。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。
欢迎关注 RAG 套件系列:
- 检索模型:[RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
- 重排模型:[RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
- 面向 RAG 场景的 LoRA 插件:[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
**RankCPM-R** is a bilingual & cross-lingual text re-ranking model developed by ModelBest Inc. and THUNLP, featuring:
- Exceptional Chinese and English re-ranking capabilities.
- Outstanding cross-lingual re-ranking capabilities between Chinese and English.
RankCPM-R is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.
We also invite you to explore the RAG toolkit series:
- Retrieval Model: [RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
- Re-ranking Model: [RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
- LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
## 模型信息 Model Information
- 模型大小:2.4B
- 最大输入token数:1024
- Model Size: 2.4B
- Max Input Tokens: 1024
## 使用方法 Usage
### 输入格式 Input Format
本模型支持指令,输入格式如下:
RankCPM-R supports instructions in the following format:
```
<s>Instruction: {{ instruction }} Query: {{ query }}</s>{{ document }}
```
例如:
For example:
```
<s>Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?</s>(文档省略)
```
```
<s>Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.</s>(document omitted)
```
也可以不提供指令,即采取如下格式:
RankCPM-R also works in instruction-free mode in the following format:
```
<s>Query: {{ query }}</s>{{ document }}
```
我们在BEIR与C-MTEB/Retrieval上测试时使用的指令见 `instructions.json`,其他测试不使用指令。
When running evaluation on BEIR and C-MTEB/Retrieval, we use instructions in `instructions.json`. For other evaluations, we do not use instructions.
### 环境要求 Requirements
```
transformers==4.37.2
flash-attn>2.3.5
```
### 示例脚本 Demo
```python
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
model_name = "openbmb/RankCPM-R"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = "right"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()
max_len_q, max_len_d = 512, 512
def tokenize_our(query,doc):
input_id_query = tokenizer.encode(query, add_special_tokens=False, max_length=max_len_q, truncation=True)
input_id_doc = tokenizer.encode(doc, add_special_tokens=False, max_length=max_len_d, truncation=True)
pad_input = {"input_ids": [tokenizer.bos_token_id] + input_id_query + [tokenizer.eos_token_id] + input_id_doc}
return tokenizer.pad(
pad_input,
padding="max_length",
max_length=max_len_q + max_len_d + 2,
return_tensors="pt",
)
@torch.no_grad()
def rerank(input_query, input_docs):
tokenized_inputs = [tokenize_our(input_query, input_doc).to("cuda") for input_doc in input_docs]
input_ids = {
"input_ids": [tokenized_input["input_ids"] for tokenized_input in tokenized_inputs],
"attention_mask": [tokenized_input["attention_mask"] for tokenized_input in tokenized_inputs]
}
for k in input_ids:
input_ids[k] = torch.stack(input_ids[k]).to("cuda")
outputs = model(**input_ids)
score = outputs.logits
return score.float().detach().cpu().numpy()
queries = ["中国的首都是哪里?"]
passages = [["beijing", "shanghai"]]
INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]
scores = []
for i in range(len(queries)):
print(queries[i])
scores.append(rerank(queries[i],passages[i]))
print(np.array(scores)) # [[[-4.7421875][-8.8515625]]]
```
## 实验结果 Evaluation Results
### 中文与英文重排序结果 CN/EN Re-ranking Results
中文对`bge-large-zh-v1.5`检索的top-100进行重排,英文对`bge-large-en-v1.5`检索的top-100进行重排。
We re-rank top-100 docments from `bge-large-zh-v1.5` in C-MTEB/Retrieval and from `bge-large-en-v1.5` in BEIR.
| 模型 Model | C-MTEB/Retrieval (NDCG@10) | BEIR (NDCG@10) |
|----------------------------|-------------------|---------------|
| bge-large-zh-v1.5(Retriever for Chinese) | 70.46 | - |
| bge-large-en-v1.5(Retriever for English) | - | 54.29 |
| bge-reranker-v2-m3 | 71.82 | 55.36 |
| bge-reranker-v2-minicpm-28 | 73.51 | 59.86 |
| bge-reranker-v2-gemma | 71.74 | 60.71 |
| bge-reranker-v2.5-gemma2 | - | **63.67** |
| RankCPM-R | **76.79** | 61.32 |
### 中英跨语言重排序结果 CN-EN Cross-lingual Re-ranking Results
对bge-m3(Dense)检索的top100进行重排。
We re-rank top-100 documents from `bge-m3` (Dense).
| 模型 Model | MKQA EN-CN (Recall@20) | NeuCLIR22 (NDCG@10) | NeuCLIR23 (NDCG@10) |
|------------------------------------|--------------------|--------------------|--------------------|
| bge-m3 (Dense)(Retriever) | 66.4 | 30.49 | 41.09 |
| jina-reranker-v2-base-multilingual | 69.33 | 36.66 | 50.03 |
| bge-reranker-v2-m3 | 69.75 | 40.98 | 49.67 |
| gte-multilingual-reranker-base | 68.51 | 38.74 | 45.3 |
| RankCPM-R | **71.73** | **43.65** | **50.59** |
## 许可证 License
- 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
- RankCPM-R 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
- RankCPM-R 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
* The usage of RankCPM-R model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of RankCPM-R are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, RankCPM-R weights are also available for free commercial use. |