xiaowenbin commited on
Commit
74e5847
1 Parent(s): 9e4ab53

Upload README_zh.md

Browse files
Files changed (1) hide show
  1. README_zh.md +1332 -0
README_zh.md ADDED
@@ -0,0 +1,1332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - mteb
8
+ model-index:
9
+ - name: Dmeta-embedding
10
+ results:
11
+ - task:
12
+ type: STS
13
+ dataset:
14
+ type: C-MTEB/AFQMC
15
+ name: MTEB AFQMC
16
+ config: default
17
+ split: validation
18
+ revision: None
19
+ metrics:
20
+ - type: cos_sim_pearson
21
+ value: 65.60825224706932
22
+ - type: cos_sim_spearman
23
+ value: 71.12862586297193
24
+ - type: euclidean_pearson
25
+ value: 70.18130275750404
26
+ - type: euclidean_spearman
27
+ value: 71.12862586297193
28
+ - type: manhattan_pearson
29
+ value: 70.14470398075396
30
+ - type: manhattan_spearman
31
+ value: 71.05226975911737
32
+ - task:
33
+ type: STS
34
+ dataset:
35
+ type: C-MTEB/ATEC
36
+ name: MTEB ATEC
37
+ config: default
38
+ split: test
39
+ revision: None
40
+ metrics:
41
+ - type: cos_sim_pearson
42
+ value: 65.52386345655479
43
+ - type: cos_sim_spearman
44
+ value: 64.64245253181382
45
+ - type: euclidean_pearson
46
+ value: 73.20157662981914
47
+ - type: euclidean_spearman
48
+ value: 64.64245253178956
49
+ - type: manhattan_pearson
50
+ value: 73.22837571756348
51
+ - type: manhattan_spearman
52
+ value: 64.62632334391418
53
+ - task:
54
+ type: Classification
55
+ dataset:
56
+ type: mteb/amazon_reviews_multi
57
+ name: MTEB AmazonReviewsClassification (zh)
58
+ config: zh
59
+ split: test
60
+ revision: 1399c76144fd37290681b995c656ef9b2e06e26d
61
+ metrics:
62
+ - type: accuracy
63
+ value: 44.925999999999995
64
+ - type: f1
65
+ value: 42.82555191308971
66
+ - task:
67
+ type: STS
68
+ dataset:
69
+ type: C-MTEB/BQ
70
+ name: MTEB BQ
71
+ config: default
72
+ split: test
73
+ revision: None
74
+ metrics:
75
+ - type: cos_sim_pearson
76
+ value: 71.35236446393156
77
+ - type: cos_sim_spearman
78
+ value: 72.29629643702184
79
+ - type: euclidean_pearson
80
+ value: 70.94570179874498
81
+ - type: euclidean_spearman
82
+ value: 72.29629297226953
83
+ - type: manhattan_pearson
84
+ value: 70.84463025501125
85
+ - type: manhattan_spearman
86
+ value: 72.24527021975821
87
+ - task:
88
+ type: Clustering
89
+ dataset:
90
+ type: C-MTEB/CLSClusteringP2P
91
+ name: MTEB CLSClusteringP2P
92
+ config: default
93
+ split: test
94
+ revision: None
95
+ metrics:
96
+ - type: v_measure
97
+ value: 40.24232916894152
98
+ - task:
99
+ type: Clustering
100
+ dataset:
101
+ type: C-MTEB/CLSClusteringS2S
102
+ name: MTEB CLSClusteringS2S
103
+ config: default
104
+ split: test
105
+ revision: None
106
+ metrics:
107
+ - type: v_measure
108
+ value: 39.167806226929706
109
+ - task:
110
+ type: Reranking
111
+ dataset:
112
+ type: C-MTEB/CMedQAv1-reranking
113
+ name: MTEB CMedQAv1
114
+ config: default
115
+ split: test
116
+ revision: None
117
+ metrics:
118
+ - type: map
119
+ value: 88.48837920106357
120
+ - type: mrr
121
+ value: 90.36861111111111
122
+ - task:
123
+ type: Reranking
124
+ dataset:
125
+ type: C-MTEB/CMedQAv2-reranking
126
+ name: MTEB CMedQAv2
127
+ config: default
128
+ split: test
129
+ revision: None
130
+ metrics:
131
+ - type: map
132
+ value: 89.17878171657071
133
+ - type: mrr
134
+ value: 91.35805555555555
135
+ - task:
136
+ type: Retrieval
137
+ dataset:
138
+ type: C-MTEB/CmedqaRetrieval
139
+ name: MTEB CmedqaRetrieval
140
+ config: default
141
+ split: dev
142
+ revision: None
143
+ metrics:
144
+ - type: map_at_1
145
+ value: 25.751
146
+ - type: map_at_10
147
+ value: 38.946
148
+ - type: map_at_100
149
+ value: 40.855000000000004
150
+ - type: map_at_1000
151
+ value: 40.953
152
+ - type: map_at_3
153
+ value: 34.533
154
+ - type: map_at_5
155
+ value: 36.905
156
+ - type: mrr_at_1
157
+ value: 39.235
158
+ - type: mrr_at_10
159
+ value: 47.713
160
+ - type: mrr_at_100
161
+ value: 48.71
162
+ - type: mrr_at_1000
163
+ value: 48.747
164
+ - type: mrr_at_3
165
+ value: 45.086
166
+ - type: mrr_at_5
167
+ value: 46.498
168
+ - type: ndcg_at_1
169
+ value: 39.235
170
+ - type: ndcg_at_10
171
+ value: 45.831
172
+ - type: ndcg_at_100
173
+ value: 53.162
174
+ - type: ndcg_at_1000
175
+ value: 54.800000000000004
176
+ - type: ndcg_at_3
177
+ value: 40.188
178
+ - type: ndcg_at_5
179
+ value: 42.387
180
+ - type: precision_at_1
181
+ value: 39.235
182
+ - type: precision_at_10
183
+ value: 10.273
184
+ - type: precision_at_100
185
+ value: 1.627
186
+ - type: precision_at_1000
187
+ value: 0.183
188
+ - type: precision_at_3
189
+ value: 22.772000000000002
190
+ - type: precision_at_5
191
+ value: 16.524
192
+ - type: recall_at_1
193
+ value: 25.751
194
+ - type: recall_at_10
195
+ value: 57.411
196
+ - type: recall_at_100
197
+ value: 87.44
198
+ - type: recall_at_1000
199
+ value: 98.386
200
+ - type: recall_at_3
201
+ value: 40.416000000000004
202
+ - type: recall_at_5
203
+ value: 47.238
204
+ - task:
205
+ type: PairClassification
206
+ dataset:
207
+ type: C-MTEB/CMNLI
208
+ name: MTEB Cmnli
209
+ config: default
210
+ split: validation
211
+ revision: None
212
+ metrics:
213
+ - type: cos_sim_accuracy
214
+ value: 83.59591100420926
215
+ - type: cos_sim_ap
216
+ value: 90.65538153970263
217
+ - type: cos_sim_f1
218
+ value: 84.76466651795673
219
+ - type: cos_sim_precision
220
+ value: 81.04073363190446
221
+ - type: cos_sim_recall
222
+ value: 88.84732288987608
223
+ - type: dot_accuracy
224
+ value: 83.59591100420926
225
+ - type: dot_ap
226
+ value: 90.64355541781003
227
+ - type: dot_f1
228
+ value: 84.76466651795673
229
+ - type: dot_precision
230
+ value: 81.04073363190446
231
+ - type: dot_recall
232
+ value: 88.84732288987608
233
+ - type: euclidean_accuracy
234
+ value: 83.59591100420926
235
+ - type: euclidean_ap
236
+ value: 90.6547878194287
237
+ - type: euclidean_f1
238
+ value: 84.76466651795673
239
+ - type: euclidean_precision
240
+ value: 81.04073363190446
241
+ - type: euclidean_recall
242
+ value: 88.84732288987608
243
+ - type: manhattan_accuracy
244
+ value: 83.51172579675286
245
+ - type: manhattan_ap
246
+ value: 90.59941589844144
247
+ - type: manhattan_f1
248
+ value: 84.51827242524917
249
+ - type: manhattan_precision
250
+ value: 80.28613507258574
251
+ - type: manhattan_recall
252
+ value: 89.22141688099134
253
+ - type: max_accuracy
254
+ value: 83.59591100420926
255
+ - type: max_ap
256
+ value: 90.65538153970263
257
+ - type: max_f1
258
+ value: 84.76466651795673
259
+ - task:
260
+ type: Retrieval
261
+ dataset:
262
+ type: C-MTEB/CovidRetrieval
263
+ name: MTEB CovidRetrieval
264
+ config: default
265
+ split: dev
266
+ revision: None
267
+ metrics:
268
+ - type: map_at_1
269
+ value: 63.251000000000005
270
+ - type: map_at_10
271
+ value: 72.442
272
+ - type: map_at_100
273
+ value: 72.79299999999999
274
+ - type: map_at_1000
275
+ value: 72.80499999999999
276
+ - type: map_at_3
277
+ value: 70.293
278
+ - type: map_at_5
279
+ value: 71.571
280
+ - type: mrr_at_1
281
+ value: 63.541000000000004
282
+ - type: mrr_at_10
283
+ value: 72.502
284
+ - type: mrr_at_100
285
+ value: 72.846
286
+ - type: mrr_at_1000
287
+ value: 72.858
288
+ - type: mrr_at_3
289
+ value: 70.39
290
+ - type: mrr_at_5
291
+ value: 71.654
292
+ - type: ndcg_at_1
293
+ value: 63.541000000000004
294
+ - type: ndcg_at_10
295
+ value: 76.774
296
+ - type: ndcg_at_100
297
+ value: 78.389
298
+ - type: ndcg_at_1000
299
+ value: 78.678
300
+ - type: ndcg_at_3
301
+ value: 72.47
302
+ - type: ndcg_at_5
303
+ value: 74.748
304
+ - type: precision_at_1
305
+ value: 63.541000000000004
306
+ - type: precision_at_10
307
+ value: 9.115
308
+ - type: precision_at_100
309
+ value: 0.9860000000000001
310
+ - type: precision_at_1000
311
+ value: 0.101
312
+ - type: precision_at_3
313
+ value: 26.379
314
+ - type: precision_at_5
315
+ value: 16.965
316
+ - type: recall_at_1
317
+ value: 63.251000000000005
318
+ - type: recall_at_10
319
+ value: 90.253
320
+ - type: recall_at_100
321
+ value: 97.576
322
+ - type: recall_at_1000
323
+ value: 99.789
324
+ - type: recall_at_3
325
+ value: 78.635
326
+ - type: recall_at_5
327
+ value: 84.141
328
+ - task:
329
+ type: Retrieval
330
+ dataset:
331
+ type: C-MTEB/DuRetrieval
332
+ name: MTEB DuRetrieval
333
+ config: default
334
+ split: dev
335
+ revision: None
336
+ metrics:
337
+ - type: map_at_1
338
+ value: 23.597
339
+ - type: map_at_10
340
+ value: 72.411
341
+ - type: map_at_100
342
+ value: 75.58500000000001
343
+ - type: map_at_1000
344
+ value: 75.64800000000001
345
+ - type: map_at_3
346
+ value: 49.61
347
+ - type: map_at_5
348
+ value: 62.527
349
+ - type: mrr_at_1
350
+ value: 84.65
351
+ - type: mrr_at_10
352
+ value: 89.43900000000001
353
+ - type: mrr_at_100
354
+ value: 89.525
355
+ - type: mrr_at_1000
356
+ value: 89.529
357
+ - type: mrr_at_3
358
+ value: 89
359
+ - type: mrr_at_5
360
+ value: 89.297
361
+ - type: ndcg_at_1
362
+ value: 84.65
363
+ - type: ndcg_at_10
364
+ value: 81.47
365
+ - type: ndcg_at_100
366
+ value: 85.198
367
+ - type: ndcg_at_1000
368
+ value: 85.828
369
+ - type: ndcg_at_3
370
+ value: 79.809
371
+ - type: ndcg_at_5
372
+ value: 78.55
373
+ - type: precision_at_1
374
+ value: 84.65
375
+ - type: precision_at_10
376
+ value: 39.595
377
+ - type: precision_at_100
378
+ value: 4.707
379
+ - type: precision_at_1000
380
+ value: 0.485
381
+ - type: precision_at_3
382
+ value: 71.61699999999999
383
+ - type: precision_at_5
384
+ value: 60.45
385
+ - type: recall_at_1
386
+ value: 23.597
387
+ - type: recall_at_10
388
+ value: 83.34
389
+ - type: recall_at_100
390
+ value: 95.19800000000001
391
+ - type: recall_at_1000
392
+ value: 98.509
393
+ - type: recall_at_3
394
+ value: 52.744
395
+ - type: recall_at_5
396
+ value: 68.411
397
+ - task:
398
+ type: Retrieval
399
+ dataset:
400
+ type: C-MTEB/EcomRetrieval
401
+ name: MTEB EcomRetrieval
402
+ config: default
403
+ split: dev
404
+ revision: None
405
+ metrics:
406
+ - type: map_at_1
407
+ value: 53.1
408
+ - type: map_at_10
409
+ value: 63.359
410
+ - type: map_at_100
411
+ value: 63.9
412
+ - type: map_at_1000
413
+ value: 63.909000000000006
414
+ - type: map_at_3
415
+ value: 60.95
416
+ - type: map_at_5
417
+ value: 62.305
418
+ - type: mrr_at_1
419
+ value: 53.1
420
+ - type: mrr_at_10
421
+ value: 63.359
422
+ - type: mrr_at_100
423
+ value: 63.9
424
+ - type: mrr_at_1000
425
+ value: 63.909000000000006
426
+ - type: mrr_at_3
427
+ value: 60.95
428
+ - type: mrr_at_5
429
+ value: 62.305
430
+ - type: ndcg_at_1
431
+ value: 53.1
432
+ - type: ndcg_at_10
433
+ value: 68.418
434
+ - type: ndcg_at_100
435
+ value: 70.88499999999999
436
+ - type: ndcg_at_1000
437
+ value: 71.135
438
+ - type: ndcg_at_3
439
+ value: 63.50599999999999
440
+ - type: ndcg_at_5
441
+ value: 65.92
442
+ - type: precision_at_1
443
+ value: 53.1
444
+ - type: precision_at_10
445
+ value: 8.43
446
+ - type: precision_at_100
447
+ value: 0.955
448
+ - type: precision_at_1000
449
+ value: 0.098
450
+ - type: precision_at_3
451
+ value: 23.633000000000003
452
+ - type: precision_at_5
453
+ value: 15.340000000000002
454
+ - type: recall_at_1
455
+ value: 53.1
456
+ - type: recall_at_10
457
+ value: 84.3
458
+ - type: recall_at_100
459
+ value: 95.5
460
+ - type: recall_at_1000
461
+ value: 97.5
462
+ - type: recall_at_3
463
+ value: 70.89999999999999
464
+ - type: recall_at_5
465
+ value: 76.7
466
+ - task:
467
+ type: Classification
468
+ dataset:
469
+ type: C-MTEB/IFlyTek-classification
470
+ name: MTEB IFlyTek
471
+ config: default
472
+ split: validation
473
+ revision: None
474
+ metrics:
475
+ - type: accuracy
476
+ value: 48.303193535975375
477
+ - type: f1
478
+ value: 35.96559358693866
479
+ - task:
480
+ type: Classification
481
+ dataset:
482
+ type: C-MTEB/JDReview-classification
483
+ name: MTEB JDReview
484
+ config: default
485
+ split: test
486
+ revision: None
487
+ metrics:
488
+ - type: accuracy
489
+ value: 85.06566604127579
490
+ - type: ap
491
+ value: 52.0596483757231
492
+ - type: f1
493
+ value: 79.5196835127668
494
+ - task:
495
+ type: STS
496
+ dataset:
497
+ type: C-MTEB/LCQMC
498
+ name: MTEB LCQMC
499
+ config: default
500
+ split: test
501
+ revision: None
502
+ metrics:
503
+ - type: cos_sim_pearson
504
+ value: 74.48499423626059
505
+ - type: cos_sim_spearman
506
+ value: 78.75806756061169
507
+ - type: euclidean_pearson
508
+ value: 78.47917601852879
509
+ - type: euclidean_spearman
510
+ value: 78.75807199272622
511
+ - type: manhattan_pearson
512
+ value: 78.40207586289772
513
+ - type: manhattan_spearman
514
+ value: 78.6911776964119
515
+ - task:
516
+ type: Reranking
517
+ dataset:
518
+ type: C-MTEB/Mmarco-reranking
519
+ name: MTEB MMarcoReranking
520
+ config: default
521
+ split: dev
522
+ revision: None
523
+ metrics:
524
+ - type: map
525
+ value: 24.75987466552363
526
+ - type: mrr
527
+ value: 23.40515873015873
528
+ - task:
529
+ type: Retrieval
530
+ dataset:
531
+ type: C-MTEB/MMarcoRetrieval
532
+ name: MTEB MMarcoRetrieval
533
+ config: default
534
+ split: dev
535
+ revision: None
536
+ metrics:
537
+ - type: map_at_1
538
+ value: 58.026999999999994
539
+ - type: map_at_10
540
+ value: 67.50699999999999
541
+ - type: map_at_100
542
+ value: 67.946
543
+ - type: map_at_1000
544
+ value: 67.96600000000001
545
+ - type: map_at_3
546
+ value: 65.503
547
+ - type: map_at_5
548
+ value: 66.649
549
+ - type: mrr_at_1
550
+ value: 60.20100000000001
551
+ - type: mrr_at_10
552
+ value: 68.271
553
+ - type: mrr_at_100
554
+ value: 68.664
555
+ - type: mrr_at_1000
556
+ value: 68.682
557
+ - type: mrr_at_3
558
+ value: 66.47800000000001
559
+ - type: mrr_at_5
560
+ value: 67.499
561
+ - type: ndcg_at_1
562
+ value: 60.20100000000001
563
+ - type: ndcg_at_10
564
+ value: 71.697
565
+ - type: ndcg_at_100
566
+ value: 73.736
567
+ - type: ndcg_at_1000
568
+ value: 74.259
569
+ - type: ndcg_at_3
570
+ value: 67.768
571
+ - type: ndcg_at_5
572
+ value: 69.72
573
+ - type: precision_at_1
574
+ value: 60.20100000000001
575
+ - type: precision_at_10
576
+ value: 8.927999999999999
577
+ - type: precision_at_100
578
+ value: 0.9950000000000001
579
+ - type: precision_at_1000
580
+ value: 0.104
581
+ - type: precision_at_3
582
+ value: 25.883
583
+ - type: precision_at_5
584
+ value: 16.55
585
+ - type: recall_at_1
586
+ value: 58.026999999999994
587
+ - type: recall_at_10
588
+ value: 83.966
589
+ - type: recall_at_100
590
+ value: 93.313
591
+ - type: recall_at_1000
592
+ value: 97.426
593
+ - type: recall_at_3
594
+ value: 73.342
595
+ - type: recall_at_5
596
+ value: 77.997
597
+ - task:
598
+ type: Classification
599
+ dataset:
600
+ type: mteb/amazon_massive_intent
601
+ name: MTEB MassiveIntentClassification (zh-CN)
602
+ config: zh-CN
603
+ split: test
604
+ revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
605
+ metrics:
606
+ - type: accuracy
607
+ value: 71.1600537995965
608
+ - type: f1
609
+ value: 68.8126216609964
610
+ - task:
611
+ type: Classification
612
+ dataset:
613
+ type: mteb/amazon_massive_scenario
614
+ name: MTEB MassiveScenarioClassification (zh-CN)
615
+ config: zh-CN
616
+ split: test
617
+ revision: 7d571f92784cd94a019292a1f45445077d0ef634
618
+ metrics:
619
+ - type: accuracy
620
+ value: 73.54068594485541
621
+ - type: f1
622
+ value: 73.46845879869848
623
+ - task:
624
+ type: Retrieval
625
+ dataset:
626
+ type: C-MTEB/MedicalRetrieval
627
+ name: MTEB MedicalRetrieval
628
+ config: default
629
+ split: dev
630
+ revision: None
631
+ metrics:
632
+ - type: map_at_1
633
+ value: 54.900000000000006
634
+ - type: map_at_10
635
+ value: 61.363
636
+ - type: map_at_100
637
+ value: 61.924
638
+ - type: map_at_1000
639
+ value: 61.967000000000006
640
+ - type: map_at_3
641
+ value: 59.767
642
+ - type: map_at_5
643
+ value: 60.802
644
+ - type: mrr_at_1
645
+ value: 55.1
646
+ - type: mrr_at_10
647
+ value: 61.454
648
+ - type: mrr_at_100
649
+ value: 62.016000000000005
650
+ - type: mrr_at_1000
651
+ value: 62.059
652
+ - type: mrr_at_3
653
+ value: 59.882999999999996
654
+ - type: mrr_at_5
655
+ value: 60.893
656
+ - type: ndcg_at_1
657
+ value: 54.900000000000006
658
+ - type: ndcg_at_10
659
+ value: 64.423
660
+ - type: ndcg_at_100
661
+ value: 67.35900000000001
662
+ - type: ndcg_at_1000
663
+ value: 68.512
664
+ - type: ndcg_at_3
665
+ value: 61.224000000000004
666
+ - type: ndcg_at_5
667
+ value: 63.083
668
+ - type: precision_at_1
669
+ value: 54.900000000000006
670
+ - type: precision_at_10
671
+ value: 7.3999999999999995
672
+ - type: precision_at_100
673
+ value: 0.882
674
+ - type: precision_at_1000
675
+ value: 0.097
676
+ - type: precision_at_3
677
+ value: 21.8
678
+ - type: precision_at_5
679
+ value: 13.98
680
+ - type: recall_at_1
681
+ value: 54.900000000000006
682
+ - type: recall_at_10
683
+ value: 74
684
+ - type: recall_at_100
685
+ value: 88.2
686
+ - type: recall_at_1000
687
+ value: 97.3
688
+ - type: recall_at_3
689
+ value: 65.4
690
+ - type: recall_at_5
691
+ value: 69.89999999999999
692
+ - task:
693
+ type: Classification
694
+ dataset:
695
+ type: C-MTEB/MultilingualSentiment-classification
696
+ name: MTEB MultilingualSentiment
697
+ config: default
698
+ split: validation
699
+ revision: None
700
+ metrics:
701
+ - type: accuracy
702
+ value: 75.15666666666667
703
+ - type: f1
704
+ value: 74.8306375354435
705
+ - task:
706
+ type: PairClassification
707
+ dataset:
708
+ type: C-MTEB/OCNLI
709
+ name: MTEB Ocnli
710
+ config: default
711
+ split: validation
712
+ revision: None
713
+ metrics:
714
+ - type: cos_sim_accuracy
715
+ value: 83.10774228478614
716
+ - type: cos_sim_ap
717
+ value: 87.17679348388666
718
+ - type: cos_sim_f1
719
+ value: 84.59302325581395
720
+ - type: cos_sim_precision
721
+ value: 78.15577439570276
722
+ - type: cos_sim_recall
723
+ value: 92.18585005279832
724
+ - type: dot_accuracy
725
+ value: 83.10774228478614
726
+ - type: dot_ap
727
+ value: 87.17679348388666
728
+ - type: dot_f1
729
+ value: 84.59302325581395
730
+ - type: dot_precision
731
+ value: 78.15577439570276
732
+ - type: dot_recall
733
+ value: 92.18585005279832
734
+ - type: euclidean_accuracy
735
+ value: 83.10774228478614
736
+ - type: euclidean_ap
737
+ value: 87.17679348388666
738
+ - type: euclidean_f1
739
+ value: 84.59302325581395
740
+ - type: euclidean_precision
741
+ value: 78.15577439570276
742
+ - type: euclidean_recall
743
+ value: 92.18585005279832
744
+ - type: manhattan_accuracy
745
+ value: 82.67460747157553
746
+ - type: manhattan_ap
747
+ value: 86.94296334435238
748
+ - type: manhattan_f1
749
+ value: 84.32327166504382
750
+ - type: manhattan_precision
751
+ value: 78.22944896115628
752
+ - type: manhattan_recall
753
+ value: 91.4466737064414
754
+ - type: max_accuracy
755
+ value: 83.10774228478614
756
+ - type: max_ap
757
+ value: 87.17679348388666
758
+ - type: max_f1
759
+ value: 84.59302325581395
760
+ - task:
761
+ type: Classification
762
+ dataset:
763
+ type: C-MTEB/OnlineShopping-classification
764
+ name: MTEB OnlineShopping
765
+ config: default
766
+ split: test
767
+ revision: None
768
+ metrics:
769
+ - type: accuracy
770
+ value: 93.24999999999999
771
+ - type: ap
772
+ value: 90.98617641063584
773
+ - type: f1
774
+ value: 93.23447883650289
775
+ - task:
776
+ type: STS
777
+ dataset:
778
+ type: C-MTEB/PAWSX
779
+ name: MTEB PAWSX
780
+ config: default
781
+ split: test
782
+ revision: None
783
+ metrics:
784
+ - type: cos_sim_pearson
785
+ value: 41.071417937737856
786
+ - type: cos_sim_spearman
787
+ value: 45.049199344455424
788
+ - type: euclidean_pearson
789
+ value: 44.913450096830786
790
+ - type: euclidean_spearman
791
+ value: 45.05733424275291
792
+ - type: manhattan_pearson
793
+ value: 44.881623825912065
794
+ - type: manhattan_spearman
795
+ value: 44.989923561416596
796
+ - task:
797
+ type: STS
798
+ dataset:
799
+ type: C-MTEB/QBQTC
800
+ name: MTEB QBQTC
801
+ config: default
802
+ split: test
803
+ revision: None
804
+ metrics:
805
+ - type: cos_sim_pearson
806
+ value: 41.38238052689359
807
+ - type: cos_sim_spearman
808
+ value: 42.61949690594399
809
+ - type: euclidean_pearson
810
+ value: 40.61261500356766
811
+ - type: euclidean_spearman
812
+ value: 42.619626605620724
813
+ - type: manhattan_pearson
814
+ value: 40.8886109204474
815
+ - type: manhattan_spearman
816
+ value: 42.75791523010463
817
+ - task:
818
+ type: STS
819
+ dataset:
820
+ type: mteb/sts22-crosslingual-sts
821
+ name: MTEB STS22 (zh)
822
+ config: zh
823
+ split: test
824
+ revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
825
+ metrics:
826
+ - type: cos_sim_pearson
827
+ value: 62.10977863727196
828
+ - type: cos_sim_spearman
829
+ value: 63.843727112473225
830
+ - type: euclidean_pearson
831
+ value: 63.25133487817196
832
+ - type: euclidean_spearman
833
+ value: 63.843727112473225
834
+ - type: manhattan_pearson
835
+ value: 63.58749018644103
836
+ - type: manhattan_spearman
837
+ value: 63.83820575456674
838
+ - task:
839
+ type: STS
840
+ dataset:
841
+ type: C-MTEB/STSB
842
+ name: MTEB STSB
843
+ config: default
844
+ split: test
845
+ revision: None
846
+ metrics:
847
+ - type: cos_sim_pearson
848
+ value: 79.30616496720054
849
+ - type: cos_sim_spearman
850
+ value: 80.767935782436
851
+ - type: euclidean_pearson
852
+ value: 80.4160642670106
853
+ - type: euclidean_spearman
854
+ value: 80.76820284024356
855
+ - type: manhattan_pearson
856
+ value: 80.27318714580251
857
+ - type: manhattan_spearman
858
+ value: 80.61030164164964
859
+ - task:
860
+ type: Reranking
861
+ dataset:
862
+ type: C-MTEB/T2Reranking
863
+ name: MTEB T2Reranking
864
+ config: default
865
+ split: dev
866
+ revision: None
867
+ metrics:
868
+ - type: map
869
+ value: 66.26242871142425
870
+ - type: mrr
871
+ value: 76.20689863623174
872
+ - task:
873
+ type: Retrieval
874
+ dataset:
875
+ type: C-MTEB/T2Retrieval
876
+ name: MTEB T2Retrieval
877
+ config: default
878
+ split: dev
879
+ revision: None
880
+ metrics:
881
+ - type: map_at_1
882
+ value: 26.240999999999996
883
+ - type: map_at_10
884
+ value: 73.009
885
+ - type: map_at_100
886
+ value: 76.893
887
+ - type: map_at_1000
888
+ value: 76.973
889
+ - type: map_at_3
890
+ value: 51.339
891
+ - type: map_at_5
892
+ value: 63.003
893
+ - type: mrr_at_1
894
+ value: 87.458
895
+ - type: mrr_at_10
896
+ value: 90.44
897
+ - type: mrr_at_100
898
+ value: 90.558
899
+ - type: mrr_at_1000
900
+ value: 90.562
901
+ - type: mrr_at_3
902
+ value: 89.89
903
+ - type: mrr_at_5
904
+ value: 90.231
905
+ - type: ndcg_at_1
906
+ value: 87.458
907
+ - type: ndcg_at_10
908
+ value: 81.325
909
+ - type: ndcg_at_100
910
+ value: 85.61999999999999
911
+ - type: ndcg_at_1000
912
+ value: 86.394
913
+ - type: ndcg_at_3
914
+ value: 82.796
915
+ - type: ndcg_at_5
916
+ value: 81.219
917
+ - type: precision_at_1
918
+ value: 87.458
919
+ - type: precision_at_10
920
+ value: 40.534
921
+ - type: precision_at_100
922
+ value: 4.96
923
+ - type: precision_at_1000
924
+ value: 0.514
925
+ - type: precision_at_3
926
+ value: 72.444
927
+ - type: precision_at_5
928
+ value: 60.601000000000006
929
+ - type: recall_at_1
930
+ value: 26.240999999999996
931
+ - type: recall_at_10
932
+ value: 80.42
933
+ - type: recall_at_100
934
+ value: 94.118
935
+ - type: recall_at_1000
936
+ value: 98.02199999999999
937
+ - type: recall_at_3
938
+ value: 53.174
939
+ - type: recall_at_5
940
+ value: 66.739
941
+ - task:
942
+ type: Classification
943
+ dataset:
944
+ type: C-MTEB/TNews-classification
945
+ name: MTEB TNews
946
+ config: default
947
+ split: validation
948
+ revision: None
949
+ metrics:
950
+ - type: accuracy
951
+ value: 52.40899999999999
952
+ - type: f1
953
+ value: 50.68532128056062
954
+ - task:
955
+ type: Clustering
956
+ dataset:
957
+ type: C-MTEB/ThuNewsClusteringP2P
958
+ name: MTEB ThuNewsClusteringP2P
959
+ config: default
960
+ split: test
961
+ revision: None
962
+ metrics:
963
+ - type: v_measure
964
+ value: 65.57616085176686
965
+ - task:
966
+ type: Clustering
967
+ dataset:
968
+ type: C-MTEB/ThuNewsClusteringS2S
969
+ name: MTEB ThuNewsClusteringS2S
970
+ config: default
971
+ split: test
972
+ revision: None
973
+ metrics:
974
+ - type: v_measure
975
+ value: 58.844999922904925
976
+ - task:
977
+ type: Retrieval
978
+ dataset:
979
+ type: C-MTEB/VideoRetrieval
980
+ name: MTEB VideoRetrieval
981
+ config: default
982
+ split: dev
983
+ revision: None
984
+ metrics:
985
+ - type: map_at_1
986
+ value: 58.4
987
+ - type: map_at_10
988
+ value: 68.64
989
+ - type: map_at_100
990
+ value: 69.062
991
+ - type: map_at_1000
992
+ value: 69.073
993
+ - type: map_at_3
994
+ value: 66.567
995
+ - type: map_at_5
996
+ value: 67.89699999999999
997
+ - type: mrr_at_1
998
+ value: 58.4
999
+ - type: mrr_at_10
1000
+ value: 68.64
1001
+ - type: mrr_at_100
1002
+ value: 69.062
1003
+ - type: mrr_at_1000
1004
+ value: 69.073
1005
+ - type: mrr_at_3
1006
+ value: 66.567
1007
+ - type: mrr_at_5
1008
+ value: 67.89699999999999
1009
+ - type: ndcg_at_1
1010
+ value: 58.4
1011
+ - type: ndcg_at_10
1012
+ value: 73.30600000000001
1013
+ - type: ndcg_at_100
1014
+ value: 75.276
1015
+ - type: ndcg_at_1000
1016
+ value: 75.553
1017
+ - type: ndcg_at_3
1018
+ value: 69.126
1019
+ - type: ndcg_at_5
1020
+ value: 71.519
1021
+ - type: precision_at_1
1022
+ value: 58.4
1023
+ - type: precision_at_10
1024
+ value: 8.780000000000001
1025
+ - type: precision_at_100
1026
+ value: 0.968
1027
+ - type: precision_at_1000
1028
+ value: 0.099
1029
+ - type: precision_at_3
1030
+ value: 25.5
1031
+ - type: precision_at_5
1032
+ value: 16.46
1033
+ - type: recall_at_1
1034
+ value: 58.4
1035
+ - type: recall_at_10
1036
+ value: 87.8
1037
+ - type: recall_at_100
1038
+ value: 96.8
1039
+ - type: recall_at_1000
1040
+ value: 99
1041
+ - type: recall_at_3
1042
+ value: 76.5
1043
+ - type: recall_at_5
1044
+ value: 82.3
1045
+ - task:
1046
+ type: Classification
1047
+ dataset:
1048
+ type: C-MTEB/waimai-classification
1049
+ name: MTEB Waimai
1050
+ config: default
1051
+ split: test
1052
+ revision: None
1053
+ metrics:
1054
+ - type: accuracy
1055
+ value: 86.21000000000001
1056
+ - type: ap
1057
+ value: 69.17460264576461
1058
+ - type: f1
1059
+ value: 84.68032984659226
1060
+ license: apache-2.0
1061
+ language:
1062
+ - zh
1063
+ - en
1064
+ ---
1065
+
1066
+ <div align="center">
1067
+ <img src="logo.png" alt="icon" width="100px"/>
1068
+ </div>
1069
+
1070
+ <h1 align="center">Dmeta-embedding</h1>
1071
+ <h4 align="center">
1072
+ <p>
1073
+ <a href="README.md">English</a> |
1074
+ <a href="README_zh.md">中文</a>
1075
+ </p>
1076
+ <p>
1077
+ <a href=#usage>用法</a> |
1078
+ <a href="#evaluation">评测(可复现)</a> |
1079
+ <a href=#faq>FAQ</a> |
1080
+ <a href="#contact">联系</a> |
1081
+ <a href="#license">版权(免费商用)</a>
1082
+ <p>
1083
+ </h4>
1084
+
1085
+ **重磅更新:**
1086
+
1087
+ - **2024.02.07**, 发布了基于 Dmeta-embedding 模型的 **Embedding API** 产品,现已开启内测,[点击申请](https://dmetasoul.feishu.cn/share/base/form/shrcnu7mN1BDwKFfgGXG9Rb1yDf)即可免费获得 **4 亿 tokens** 使用额度,可编码大约 GB 级别汉字文本。
1088
+
1089
+ - 我们的初心。既要开源优秀的技术能力,又希望大家能够在实际业务中使用起来,用起来的技术才是好技术、能落地创造价值的技术才是值得长期投入的。帮助大家解决业务落地最后一公里的障碍,让大家把 Embedding 技术低成本的用起来,更多去关注自身的商业和产品服务,把复杂的技术部分交给我们。
1090
+ - 申请和使用。[点击申请](https://dmetasoul.feishu.cn/share/base/form/shrcnu7mN1BDwKFfgGXG9Rb1yDf),填写一个表单即可,48小时之内我们会通过 <[email protected]> 给您答复邮件。Embedding API 为了兼容大模型技术生态,使用方式跟 OpenAI 一致,具体用法我们将在答复邮件中进行说明。
1091
+ - 加入社群。后续我们会不断在大模型/AIGC等方向发力,为社区带来有价值、低门槛的技术,可以[点击图片](https://huggingface.co/DMetaSoul/Dmeta-embedding/resolve/main/weixin.jpeg),扫面二维码来加入我们的微信社群,一起在 AIGC 赛道加油呀!
1092
+
1093
+
1094
+ Dmeta-embedding 是一款跨领域、跨任务、开箱即用的中文 Embedding 模型,适用于搜索、问答、智能客服、LLM+RAG 等各种业务场景,支持使用 Transformers/Sentence-Transformers/Langchain 等工具加载推理。
1095
+
1096
+ 优势特点如下:
1097
+
1098
+ - 多任务、场景泛化性能优异,目前已取得 **[MTEB](https://huggingface.co/spaces/mteb/leaderboard) 中文榜单第二成绩**(2024.01.25)
1099
+ - 模型参数大小仅 **400MB**,对比参数量超过 GB 级模型,可以极大降低推理成本
1100
+ - 支持上下文窗口长度达到 **1024**,对于长文本检索、RAG 等场景更适配
1101
+
1102
+ ## Usage
1103
+
1104
+ 目前模型支持通过 [Sentence-Transformers](#sentence-transformers), [Langchain](#langchain), [Huggingface Transformers](#huggingface-transformers) 等主流框架进行推理,具体用法参考各个框架的示例。
1105
+
1106
+ ### Sentence-Transformers
1107
+
1108
+ Dmeta-embedding 模型支持通过 [sentence-transformers](https://www.SBERT.net) 来加载推理:
1109
+
1110
+ ```
1111
+ pip install -U sentence-transformers
1112
+ ```
1113
+
1114
+ ```python
1115
+ from sentence_transformers import SentenceTransformer
1116
+
1117
+ texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
1118
+ texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
1119
+
1120
+ model = SentenceTransformer('DMetaSoul/Dmeta-embedding')
1121
+ embs1 = model.encode(texts1, normalize_embeddings=True)
1122
+ embs2 = model.encode(texts2, normalize_embeddings=True)
1123
+
1124
+ # 计算两两相似度
1125
+ similarity = embs1 @ embs2.T
1126
+ print(similarity)
1127
+
1128
+ # 获取 texts1[i] 对应的最相似 texts2[j]
1129
+ for i in range(len(texts1)):
1130
+ scores = []
1131
+ for j in range(len(texts2)):
1132
+ scores.append([texts2[j], similarity[i][j]])
1133
+ scores = sorted(scores, key=lambda x:x[1], reverse=True)
1134
+
1135
+ print(f"查询文本:{texts1[i]}")
1136
+ for text2, score in scores:
1137
+ print(f"相似文本:{text2},打分:{score}")
1138
+ print()
1139
+ ```
1140
+
1141
+ 示例输出如下:
1142
+
1143
+ ```
1144
+ 查询文本:胡子长得太快怎么办?
1145
+ 相似文本:胡子长得快怎么办?,打分:0.9535336494445801
1146
+ 相似文本:怎样使胡子不浓密!,打分:0.6776421070098877
1147
+ 相似文本:香港买手表哪里好,打分:0.2297907918691635
1148
+ 相似文本:在杭州手机到哪里买,打分:0.11386542022228241
1149
+
1150
+ 查询文本:在香港哪里买手表好
1151
+ 相似文本:香港买手表哪里好,打分:0.9843372106552124
1152
+ 相似文本:在杭州手机到哪里买,打分:0.45211508870124817
1153
+ 相似文本:胡子长得快怎么办?,打分:0.19985519349575043
1154
+ 相似文本:怎样使胡子不浓密!,打分:0.18558596074581146
1155
+ ```
1156
+
1157
+ ### Langchain
1158
+
1159
+ Dmeta-embedding 模型支持通过 LLM 工具框架 [langchain](https://www.langchain.com/) 来加载推理:
1160
+
1161
+ ```
1162
+ pip install -U langchain
1163
+ ```
1164
+
1165
+ ```python
1166
+ import torch
1167
+ import numpy as np
1168
+ from langchain.embeddings import HuggingFaceEmbeddings
1169
+
1170
+ model_name = "DMetaSoul/Dmeta-embedding"
1171
+ model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
1172
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
1173
+
1174
+ model = HuggingFaceEmbeddings(
1175
+ model_name=model_name,
1176
+ model_kwargs=model_kwargs,
1177
+ encode_kwargs=encode_kwargs,
1178
+ )
1179
+
1180
+ texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
1181
+ texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
1182
+
1183
+ embs1 = model.embed_documents(texts1)
1184
+ embs2 = model.embed_documents(texts2)
1185
+ embs1, embs2 = np.array(embs1), np.array(embs2)
1186
+
1187
+ # 计算两两相似度
1188
+ similarity = embs1 @ embs2.T
1189
+ print(similarity)
1190
+
1191
+ # 获取 texts1[i] 对应的最相似 texts2[j]
1192
+ for i in range(len(texts1)):
1193
+ scores = []
1194
+ for j in range(len(texts2)):
1195
+ scores.append([texts2[j], similarity[i][j]])
1196
+ scores = sorted(scores, key=lambda x:x[1], reverse=True)
1197
+
1198
+ print(f"查询文本:{texts1[i]}")
1199
+ for text2, score in scores:
1200
+ print(f"相似文本:{text2},打分:{score}")
1201
+ print()
1202
+ ```
1203
+
1204
+ ### HuggingFace Transformers
1205
+
1206
+ Dmeta-embedding 模型支持通过 [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) 框架来加载推理:
1207
+
1208
+ ```
1209
+ pip install -U transformers
1210
+ ```
1211
+
1212
+ ```python
1213
+ import torch
1214
+ from transformers import AutoTokenizer, AutoModel
1215
+
1216
+
1217
+ def mean_pooling(model_output, attention_mask):
1218
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
1219
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1220
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
1221
+
1222
+ def cls_pooling(model_output):
1223
+ return model_output[0][:, 0]
1224
+
1225
+
1226
+ texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
1227
+ texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
1228
+
1229
+ tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/Dmeta-embedding')
1230
+ model = AutoModel.from_pretrained('DMetaSoul/Dmeta-embedding')
1231
+ model.eval()
1232
+
1233
+ with torch.no_grad():
1234
+ inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors='pt')
1235
+ inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors='pt')
1236
+
1237
+ model_output1 = model(**inputs1)
1238
+ model_output2 = model(**inputs2)
1239
+ embs1, embs2 = cls_pooling(model_output1), cls_pooling(model_output2)
1240
+ embs1 = torch.nn.functional.normalize(embs1, p=2, dim=1).numpy()
1241
+ embs2 = torch.nn.functional.normalize(embs2, p=2, dim=1).numpy()
1242
+
1243
+ # 计算两两相似度
1244
+ similarity = embs1 @ embs2.T
1245
+ print(similarity)
1246
+
1247
+ # 获取 texts1[i] 对应的最相似 texts2[j]
1248
+ for i in range(len(texts1)):
1249
+ scores = []
1250
+ for j in range(len(texts2)):
1251
+ scores.append([texts2[j], similarity[i][j]])
1252
+ scores = sorted(scores, key=lambda x:x[1], reverse=True)
1253
+
1254
+ print(f"查询文本:{texts1[i]}")
1255
+ for text2, score in scores:
1256
+ print(f"相似文本:{text2},打分:{score}")
1257
+ print()
1258
+ ```
1259
+
1260
+ ## Evaluation
1261
+
1262
+ Dmeta-embedding 模型在 [MTEB 中文榜单](https://huggingface.co/spaces/mteb/leaderboard)取得开源第一的成绩(2024.01.25,Baichuan 榜单第一、未开源),具体关于评测数据和代码可参考 MTEB 官方[仓库](https://github.com/embeddings-benchmark/mteb)。
1263
+
1264
+ **MTEB Chinese**:
1265
+
1266
+ 该[榜单数据集](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB)由智源研究院团队(BAAI)收集整理,包含 6 个经典任务共计 35 个中文数据集,涵盖了分类、检索、排序、句对、STS 等任务,是目前 Embedding 模型全方位能力评测的全球权威榜单。
1267
+
1268
+ | Model | Vendor | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
1269
+ |:-------------------------------------------------------------------------------------------------------- | ------ |:-------------------:|:-----:|:---------:|:-----:|:------------------:|:--------------:|:---------:|:----------:|
1270
+ | [Dmeta-embedding](https://huggingface.co/DMetaSoul/Dmeta-embedding) | 数元灵 | 1024 | 67.51 | 70.41 | 64.09 | 88.92 | 70 | 67.17 | 50.96 |
1271
+ | [gte-large-zh](https://huggingface.co/thenlper/gte-large-zh) | 阿里达摩院 | 1024 | 66.72 | 72.49 | 57.82 | 84.41 | 71.34 | 67.4 | 53.07 |
1272
+ | [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | 智源 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 |
1273
+ | [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | 智源 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 |
1274
+ | [text-embedding-ada-002(OpenAI)](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) | OpenAI | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 |
1275
+ | [text2vec-base](https://huggingface.co/shibing624/text2vec-base-chinese) | 个人 | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 |
1276
+ | [text2vec-large](https://huggingface.co/GanymedeNil/text2vec-large-chinese) | 个人 | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 |
1277
+
1278
+ ## FAQ
1279
+
1280
+ <details>
1281
+ <summary>1. 为何模型多任务、场景泛化能力优异,可开箱即用适配诸多应用场景?</summary>
1282
+
1283
+ <!-- ### 为何模型多任务、场景泛化能力优异,可开箱即用适配诸多应用场景? -->
1284
+
1285
+ 简单来说,模型优异的泛化能力来自于预训练数据的广泛和多样,以及模型优化时面向多任务场景设计了不同优化目标。
1286
+
1287
+ 具体来说,技术要点有:
1288
+
1289
+ 1)首先是大规模弱标签对比学习。业界经验表明开箱即用的语言模型在 Embedding 相关任务上表现不佳,但由于监督数据标注、获取成本较高,因此大规模、高质量的弱标签学习成为一条可选技术路线。通过在互联网上论坛、新闻、问答社区、百科等半结构化数据中提取弱标签,并利用大模型进行低质过滤,得到 10 亿级别弱监督文本对数据。
1290
+
1291
+ 2)其次是高质量监督学习。我们收集整理了大规模开源标注的语句对数据集,包含百科、教育、金融、医疗、法律、新闻、学术等多个领域共计 3000 万句对样本。同时挖掘难负样本对,借助对比学习更好的进行模型优化。
1292
+
1293
+ 3)最后是检索任务针对性优化。考虑到搜索、问答以及 RAG 等场景是 Embedding 模型落地的重要应用阵地,为了增强模型跨领域、跨场景的效果性能,我们专门针对检索任务进行了模型优化,核心在于从问答、检索等数据中挖掘难负样本,借助稀疏和稠密检索等多种手段,构造百万级难负样本对数据集,显著提升了模型跨领域的检索性能。
1294
+
1295
+ </details>
1296
+
1297
+ <details>
1298
+ <summary>2. 模型可以商用吗?</summary>
1299
+
1300
+ <!-- ### 模型可以商用吗 -->
1301
+
1302
+ 我们的开源模型基于 Apache-2.0 协议,完全支持免费商用。
1303
+
1304
+ </details>
1305
+
1306
+ <details>
1307
+ <summary>3. 如何复现 MTEB 评测结果?</summary>
1308
+
1309
+ <!-- ### 如何复现 MTEB 评测结果? -->
1310
+
1311
+ 我们在模型仓库中提供了脚本 mteb_eval.py,您可以直接运行此脚本来复现我们的评测结果。
1312
+
1313
+ </details>
1314
+
1315
+ <details>
1316
+ <summary>4. 后续规划有哪些?</summary>
1317
+
1318
+ <!-- ### 后续规划有哪些? -->
1319
+
1320
+ 我们将不断致力于为社区提供效果优异、推理轻量、多场景开箱即用的 Embedding 模型,同时我们也会将 Embedding 逐步整合到目前已经的技术生态中,跟随社区一起成长!
1321
+
1322
+ </details>
1323
+
1324
+ ## Contact
1325
+
1326
+ 您如果在使用过程中,遇到任何问题,欢迎前往[讨论区](https://huggingface.co/DMetaSoul/Dmeta-embedding/discussions)建言献策。
1327
+
1328
+ 您也可以联系我们:赵中昊 <[email protected]>, 肖文斌 <[email protected]>, 孙凯 <[email protected]>
1329
+
1330
+ ## License
1331
+
1332
+ Dmeta-embedding 模型采用 Apache-2.0 License,开源模型可以进行免费商用私有部署。