|
--- |
|
tags: |
|
- mteb |
|
- qihoo360 |
|
- 奇虎360 |
|
- RAG-retrieval |
|
model-index: |
|
- name: 360Zhinao_search |
|
results: |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv1-reranking |
|
name: MTEB CMedQAv1 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 87.004722953844 |
|
- type: mrr |
|
value: 89.34686507936507 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv2-reranking |
|
name: MTEB CMedQAv2 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 88.48306990136507 |
|
- type: mrr |
|
value: 90.57761904761904 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/Mmarco-reranking |
|
name: MTEB MMarcoReranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 32.40909999537645 |
|
- type: mrr |
|
value: 31.48690476190476 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/T2Reranking |
|
name: MTEB T2Reranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 67.80300509862872 |
|
- type: mrr |
|
value: 78.14543234355354 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CmedqaRetrieval |
|
name: MTEB CmedqaRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 27.171 |
|
- type: map_at_10 |
|
value: 40.109 |
|
- type: map_at_100 |
|
value: 41.937999999999995 |
|
- type: map_at_1000 |
|
value: 42.051 |
|
- type: map_at_3 |
|
value: 35.882999999999996 |
|
- type: map_at_5 |
|
value: 38.22 |
|
- type: mrr_at_1 |
|
value: 41.285 |
|
- type: mrr_at_10 |
|
value: 49.247 |
|
- type: mrr_at_100 |
|
value: 50.199000000000005 |
|
- type: mrr_at_1000 |
|
value: 50.245 |
|
- type: mrr_at_3 |
|
value: 46.837 |
|
- type: mrr_at_5 |
|
value: 48.223 |
|
- type: ndcg_at_1 |
|
value: 41.285 |
|
- type: ndcg_at_10 |
|
value: 46.727000000000004 |
|
- type: ndcg_at_100 |
|
value: 53.791 |
|
- type: ndcg_at_1000 |
|
value: 55.706 |
|
- type: ndcg_at_3 |
|
value: 41.613 |
|
- type: ndcg_at_5 |
|
value: 43.702999999999996 |
|
- type: precision_at_1 |
|
value: 41.285 |
|
- type: precision_at_10 |
|
value: 10.34 |
|
- type: precision_at_100 |
|
value: 1.6019999999999999 |
|
- type: precision_at_1000 |
|
value: 0.184 |
|
- type: precision_at_3 |
|
value: 23.423 |
|
- type: precision_at_5 |
|
value: 16.914 |
|
- type: recall_at_1 |
|
value: 27.171 |
|
- type: recall_at_10 |
|
value: 57.04900000000001 |
|
- type: recall_at_100 |
|
value: 86.271 |
|
- type: recall_at_1000 |
|
value: 99.02300000000001 |
|
- type: recall_at_3 |
|
value: 41.528 |
|
- type: recall_at_5 |
|
value: 48.162 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CovidRetrieval |
|
name: MTEB CovidRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 73.762 |
|
- type: map_at_10 |
|
value: 81.663 |
|
- type: map_at_100 |
|
value: 81.87100000000001 |
|
- type: map_at_1000 |
|
value: 81.877 |
|
- type: map_at_3 |
|
value: 80.10199999999999 |
|
- type: map_at_5 |
|
value: 81.162 |
|
- type: mrr_at_1 |
|
value: 74.078 |
|
- type: mrr_at_10 |
|
value: 81.745 |
|
- type: mrr_at_100 |
|
value: 81.953 |
|
- type: mrr_at_1000 |
|
value: 81.959 |
|
- type: mrr_at_3 |
|
value: 80.25999999999999 |
|
- type: mrr_at_5 |
|
value: 81.266 |
|
- type: ndcg_at_1 |
|
value: 73.973 |
|
- type: ndcg_at_10 |
|
value: 85.021 |
|
- type: ndcg_at_100 |
|
value: 85.884 |
|
- type: ndcg_at_1000 |
|
value: 86.02300000000001 |
|
- type: ndcg_at_3 |
|
value: 82.03399999999999 |
|
- type: ndcg_at_5 |
|
value: 83.905 |
|
- type: precision_at_1 |
|
value: 73.973 |
|
- type: precision_at_10 |
|
value: 9.631 |
|
- type: precision_at_100 |
|
value: 1 |
|
- type: precision_at_1000 |
|
value: 0.101 |
|
- type: precision_at_3 |
|
value: 29.329 |
|
- type: precision_at_5 |
|
value: 18.546000000000003 |
|
- type: recall_at_1 |
|
value: 73.762 |
|
- type: recall_at_10 |
|
value: 95.258 |
|
- type: recall_at_100 |
|
value: 98.946 |
|
- type: recall_at_1000 |
|
value: 100 |
|
- type: recall_at_3 |
|
value: 87.46000000000001 |
|
- type: recall_at_5 |
|
value: 91.93900000000001 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/DuRetrieval |
|
name: MTEB DuRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 25.967000000000002 |
|
- type: map_at_10 |
|
value: 79.928 |
|
- type: map_at_100 |
|
value: 82.76400000000001 |
|
- type: map_at_1000 |
|
value: 82.794 |
|
- type: map_at_3 |
|
value: 54.432 |
|
- type: map_at_5 |
|
value: 69.246 |
|
- type: mrr_at_1 |
|
value: 89 |
|
- type: mrr_at_10 |
|
value: 92.81 |
|
- type: mrr_at_100 |
|
value: 92.857 |
|
- type: mrr_at_1000 |
|
value: 92.86 |
|
- type: mrr_at_3 |
|
value: 92.467 |
|
- type: mrr_at_5 |
|
value: 92.67699999999999 |
|
- type: ndcg_at_1 |
|
value: 89 |
|
- type: ndcg_at_10 |
|
value: 87.57000000000001 |
|
- type: ndcg_at_100 |
|
value: 90.135 |
|
- type: ndcg_at_1000 |
|
value: 90.427 |
|
- type: ndcg_at_3 |
|
value: 84.88900000000001 |
|
- type: ndcg_at_5 |
|
value: 84.607 |
|
- type: precision_at_1 |
|
value: 89 |
|
- type: precision_at_10 |
|
value: 42.245 |
|
- type: precision_at_100 |
|
value: 4.8340000000000005 |
|
- type: precision_at_1000 |
|
value: 0.49 |
|
- type: precision_at_3 |
|
value: 75.883 |
|
- type: precision_at_5 |
|
value: 64.88000000000001 |
|
- type: recall_at_1 |
|
value: 25.967000000000002 |
|
- type: recall_at_10 |
|
value: 89.79599999999999 |
|
- type: recall_at_100 |
|
value: 98.042 |
|
- type: recall_at_1000 |
|
value: 99.61 |
|
- type: recall_at_3 |
|
value: 57.084 |
|
- type: recall_at_5 |
|
value: 74.763 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/EcomRetrieval |
|
name: MTEB EcomRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 53.6 |
|
- type: map_at_10 |
|
value: 63.94800000000001 |
|
- type: map_at_100 |
|
value: 64.37899999999999 |
|
- type: map_at_1000 |
|
value: 64.39200000000001 |
|
- type: map_at_3 |
|
value: 61.683 |
|
- type: map_at_5 |
|
value: 63.078 |
|
- type: mrr_at_1 |
|
value: 53.6 |
|
- type: mrr_at_10 |
|
value: 63.94800000000001 |
|
- type: mrr_at_100 |
|
value: 64.37899999999999 |
|
- type: mrr_at_1000 |
|
value: 64.39200000000001 |
|
- type: mrr_at_3 |
|
value: 61.683 |
|
- type: mrr_at_5 |
|
value: 63.078 |
|
- type: ndcg_at_1 |
|
value: 53.6 |
|
- type: ndcg_at_10 |
|
value: 68.904 |
|
- type: ndcg_at_100 |
|
value: 71.019 |
|
- type: ndcg_at_1000 |
|
value: 71.345 |
|
- type: ndcg_at_3 |
|
value: 64.30799999999999 |
|
- type: ndcg_at_5 |
|
value: 66.8 |
|
- type: precision_at_1 |
|
value: 53.6 |
|
- type: precision_at_10 |
|
value: 8.44 |
|
- type: precision_at_100 |
|
value: 0.943 |
|
- type: precision_at_1000 |
|
value: 0.097 |
|
- type: precision_at_3 |
|
value: 23.967 |
|
- type: precision_at_5 |
|
value: 15.58 |
|
- type: recall_at_1 |
|
value: 53.6 |
|
- type: recall_at_10 |
|
value: 84.39999999999999 |
|
- type: recall_at_100 |
|
value: 94.3 |
|
- type: recall_at_1000 |
|
value: 96.8 |
|
- type: recall_at_3 |
|
value: 71.89999999999999 |
|
- type: recall_at_5 |
|
value: 77.9 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MMarcoRetrieval |
|
name: MTEB MMarcoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 71.375 |
|
- type: map_at_10 |
|
value: 80.05600000000001 |
|
- type: map_at_100 |
|
value: 80.28699999999999 |
|
- type: map_at_1000 |
|
value: 80.294 |
|
- type: map_at_3 |
|
value: 78.479 |
|
- type: map_at_5 |
|
value: 79.51899999999999 |
|
- type: mrr_at_1 |
|
value: 73.739 |
|
- type: mrr_at_10 |
|
value: 80.535 |
|
- type: mrr_at_100 |
|
value: 80.735 |
|
- type: mrr_at_1000 |
|
value: 80.742 |
|
- type: mrr_at_3 |
|
value: 79.212 |
|
- type: mrr_at_5 |
|
value: 80.059 |
|
- type: ndcg_at_1 |
|
value: 73.739 |
|
- type: ndcg_at_10 |
|
value: 83.321 |
|
- type: ndcg_at_100 |
|
value: 84.35000000000001 |
|
- type: ndcg_at_1000 |
|
value: 84.542 |
|
- type: ndcg_at_3 |
|
value: 80.401 |
|
- type: ndcg_at_5 |
|
value: 82.107 |
|
- type: precision_at_1 |
|
value: 73.739 |
|
- type: precision_at_10 |
|
value: 9.878 |
|
- type: precision_at_100 |
|
value: 1.039 |
|
- type: precision_at_1000 |
|
value: 0.106 |
|
- type: precision_at_3 |
|
value: 30.053 |
|
- type: precision_at_5 |
|
value: 18.953999999999997 |
|
- type: recall_at_1 |
|
value: 71.375 |
|
- type: recall_at_10 |
|
value: 92.84599999999999 |
|
- type: recall_at_100 |
|
value: 97.49799999999999 |
|
- type: recall_at_1000 |
|
value: 98.992 |
|
- type: recall_at_3 |
|
value: 85.199 |
|
- type: recall_at_5 |
|
value: 89.22 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MedicalRetrieval |
|
name: MTEB MedicalRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 55.60000000000001 |
|
- type: map_at_10 |
|
value: 61.035 |
|
- type: map_at_100 |
|
value: 61.541999999999994 |
|
- type: map_at_1000 |
|
value: 61.598 |
|
- type: map_at_3 |
|
value: 59.683 |
|
- type: map_at_5 |
|
value: 60.478 |
|
- type: mrr_at_1 |
|
value: 55.60000000000001 |
|
- type: mrr_at_10 |
|
value: 61.035 |
|
- type: mrr_at_100 |
|
value: 61.541999999999994 |
|
- type: mrr_at_1000 |
|
value: 61.598 |
|
- type: mrr_at_3 |
|
value: 59.683 |
|
- type: mrr_at_5 |
|
value: 60.478 |
|
- type: ndcg_at_1 |
|
value: 55.60000000000001 |
|
- type: ndcg_at_10 |
|
value: 63.686 |
|
- type: ndcg_at_100 |
|
value: 66.417 |
|
- type: ndcg_at_1000 |
|
value: 67.92399999999999 |
|
- type: ndcg_at_3 |
|
value: 60.951 |
|
- type: ndcg_at_5 |
|
value: 62.388 |
|
- type: precision_at_1 |
|
value: 55.60000000000001 |
|
- type: precision_at_10 |
|
value: 7.199999999999999 |
|
- type: precision_at_100 |
|
value: 0.8540000000000001 |
|
- type: precision_at_1000 |
|
value: 0.097 |
|
- type: precision_at_3 |
|
value: 21.532999999999998 |
|
- type: precision_at_5 |
|
value: 13.62 |
|
- type: recall_at_1 |
|
value: 55.60000000000001 |
|
- type: recall_at_10 |
|
value: 72 |
|
- type: recall_at_100 |
|
value: 85.39999999999999 |
|
- type: recall_at_1000 |
|
value: 97.3 |
|
- type: recall_at_3 |
|
value: 64.60000000000001 |
|
- type: recall_at_5 |
|
value: 68.10000000000001 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/T2Retrieval |
|
name: MTEB T2Retrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 28.314 |
|
- type: map_at_10 |
|
value: 80.268 |
|
- type: map_at_100 |
|
value: 83.75399999999999 |
|
- type: map_at_1000 |
|
value: 83.80499999999999 |
|
- type: map_at_3 |
|
value: 56.313 |
|
- type: map_at_5 |
|
value: 69.336 |
|
- type: mrr_at_1 |
|
value: 91.96 |
|
- type: mrr_at_10 |
|
value: 93.926 |
|
- type: mrr_at_100 |
|
value: 94 |
|
- type: mrr_at_1000 |
|
value: 94.003 |
|
- type: mrr_at_3 |
|
value: 93.587 |
|
- type: mrr_at_5 |
|
value: 93.804 |
|
- type: ndcg_at_1 |
|
value: 91.96 |
|
- type: ndcg_at_10 |
|
value: 87.12299999999999 |
|
- type: ndcg_at_100 |
|
value: 90.238 |
|
- type: ndcg_at_1000 |
|
value: 90.723 |
|
- type: ndcg_at_3 |
|
value: 88.347 |
|
- type: ndcg_at_5 |
|
value: 87.095 |
|
- type: precision_at_1 |
|
value: 91.96 |
|
- type: precision_at_10 |
|
value: 43.257 |
|
- type: precision_at_100 |
|
value: 5.064 |
|
- type: precision_at_1000 |
|
value: 0.517 |
|
- type: precision_at_3 |
|
value: 77.269 |
|
- type: precision_at_5 |
|
value: 64.89 |
|
- type: recall_at_1 |
|
value: 28.314 |
|
- type: recall_at_10 |
|
value: 85.917 |
|
- type: recall_at_100 |
|
value: 96.297 |
|
- type: recall_at_1000 |
|
value: 98.802 |
|
- type: recall_at_3 |
|
value: 57.75900000000001 |
|
- type: recall_at_5 |
|
value: 72.287 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/VideoRetrieval |
|
name: MTEB VideoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 65.60000000000001 |
|
- type: map_at_10 |
|
value: 74.502 |
|
- type: map_at_100 |
|
value: 74.864 |
|
- type: map_at_1000 |
|
value: 74.875 |
|
- type: map_at_3 |
|
value: 73.3 |
|
- type: map_at_5 |
|
value: 74.07000000000001 |
|
- type: mrr_at_1 |
|
value: 65.60000000000001 |
|
- type: mrr_at_10 |
|
value: 74.502 |
|
- type: mrr_at_100 |
|
value: 74.864 |
|
- type: mrr_at_1000 |
|
value: 74.875 |
|
- type: mrr_at_3 |
|
value: 73.3 |
|
- type: mrr_at_5 |
|
value: 74.07000000000001 |
|
- type: ndcg_at_1 |
|
value: 65.60000000000001 |
|
- type: ndcg_at_10 |
|
value: 78.091 |
|
- type: ndcg_at_100 |
|
value: 79.838 |
|
- type: ndcg_at_1000 |
|
value: 80.10199999999999 |
|
- type: ndcg_at_3 |
|
value: 75.697 |
|
- type: ndcg_at_5 |
|
value: 77.07000000000001 |
|
- type: precision_at_1 |
|
value: 65.60000000000001 |
|
- type: precision_at_10 |
|
value: 8.9 |
|
- type: precision_at_100 |
|
value: 0.971 |
|
- type: precision_at_1000 |
|
value: 0.099 |
|
- type: precision_at_3 |
|
value: 27.533 |
|
- type: precision_at_5 |
|
value: 17.18 |
|
- type: recall_at_1 |
|
value: 65.60000000000001 |
|
- type: recall_at_10 |
|
value: 89 |
|
- type: recall_at_100 |
|
value: 97.1 |
|
- type: recall_at_1000 |
|
value: 99.1 |
|
- type: recall_at_3 |
|
value: 82.6 |
|
- type: recall_at_5 |
|
value: 85.9 |
|
license: apache-2.0 |
|
library_name: transformers |
|
--- |
|
|
|
# Model Introduction |
|
360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first. |
|
|
|
[C-MTEB-Retrieval leaderboard](https://huggingface.co/spaces/mteb/leaderboard) contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index. |
|
|
|
| Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg | |
|
|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| |
|
|**360Zhinao-search** | 87.12 | 83.32 | 87.57 | 85.02 | 46.73 | 68.9 | 63.69 | 78.09 | **75.05** | |
|
|AGE_Hybrid | 86.88 | 80.65 | 89.28 | 83.66 | 47.26 | 69.28 | 65.94 | 76.79 | 74.97 | |
|
|OpenSearch-text-hybrid | 86.76 | 79.93 | 87.85 | 84.03 | 46.56 | 68.79 | 65.92 | 75.43 | 74.41 | |
|
|piccolo-large-zh-v2 | 86.14 | 79.54 | 89.14 | 86.78 | 47.58 | 67.75 | 64.88 | 73.1 | 74.36 | |
|
|stella-large-zh-v3-1792d | 85.56 | 79.14 | 87.13 | 82.44 | 46.87 | 68.62 | 65.18 | 73.89 | 73.6 | |
|
|
|
## Optimization points |
|
1. Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set; |
|
2. Data source enhancement: Use open source data and LLM synthetic data to improve data diversity; |
|
3. Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain; |
|
4. Training efficiency: multi-machine multi-GPU training + Deepspeed method to optimize GPU memory utilization. |
|
|
|
## Usage |
|
```bash |
|
from typing import cast, List, Dict, Union |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
import numpy as np |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search') |
|
model = AutoModel.from_pretrained('qihoo360/360Zhinao-search') |
|
sentences = ['天空是什么颜色的', '天空是蓝色的'] |
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512) |
|
|
|
if __name__ == "__main__": |
|
|
|
with torch.no_grad(): |
|
last_hidden_state = model(**inputs, return_dict=True).last_hidden_state |
|
embeddings = last_hidden_state[:, 0] |
|
embeddings = torch.nn.functional.normalize(embeddings, dim=-1) |
|
embeddings = embeddings.cpu().numpy() |
|
|
|
print("embeddings:") |
|
print(embeddings) |
|
|
|
cos_sim = np.dot(embeddings[0], embeddings[1]) |
|
print("cos_sim:", cos_sim) |
|
|
|
``` |
|
|
|
## Reference |
|
[bge fine-tuning code](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) |
|
|
|
[C-MTEB official test script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) |
|
|
|
|
|
## License |
|
The source code of this repository follows the open-source license Apache 2.0. |
|
|
|
360Zhinao open-source models support commercial use. If you wish to use these models or continue training them for commercial purposes, please contact us via email ([email protected]) to apply. For the specific license agreement, please see <<360 Zhinao Open-Source Model License>>. |