--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - mteb model-index: - name: mmlw-e5-small results: - task: type: Clustering dataset: type: PL-MTEB/8tags-clustering name: MTEB 8TagsClustering config: default split: test revision: None metrics: - type: v_measure value: 31.772224277808153 - task: type: Classification dataset: type: PL-MTEB/allegro-reviews name: MTEB AllegroReviews config: default split: test revision: None metrics: - type: accuracy value: 33.03180914512922 - type: f1 value: 29.800304217426167 - task: type: Retrieval dataset: type: arguana-pl name: MTEB ArguAna-PL config: default split: test revision: None metrics: - type: map_at_1 value: 28.804999999999996 - type: map_at_10 value: 45.327 - type: map_at_100 value: 46.17 - type: map_at_1000 value: 46.177 - type: map_at_3 value: 40.528999999999996 - type: map_at_5 value: 43.335 - type: mrr_at_1 value: 30.299 - type: mrr_at_10 value: 45.763 - type: mrr_at_100 value: 46.641 - type: mrr_at_1000 value: 46.648 - type: mrr_at_3 value: 41.074 - type: mrr_at_5 value: 43.836999999999996 - type: ndcg_at_1 value: 28.804999999999996 - type: ndcg_at_10 value: 54.308 - type: ndcg_at_100 value: 57.879000000000005 - type: ndcg_at_1000 value: 58.048 - type: ndcg_at_3 value: 44.502 - type: ndcg_at_5 value: 49.519000000000005 - type: precision_at_1 value: 28.804999999999996 - type: precision_at_10 value: 8.286 - type: precision_at_100 value: 0.984 - type: precision_at_1000 value: 0.1 - type: precision_at_3 value: 18.682000000000002 - type: precision_at_5 value: 13.627 - type: recall_at_1 value: 28.804999999999996 - type: recall_at_10 value: 82.85900000000001 - type: recall_at_100 value: 98.36399999999999 - type: recall_at_1000 value: 99.644 - type: recall_at_3 value: 56.04599999999999 - type: recall_at_5 value: 68.137 - task: type: Classification dataset: type: PL-MTEB/cbd name: MTEB CBD config: default split: test revision: None metrics: - type: accuracy value: 64.24 - type: ap value: 17.967103105024705 - type: f1 value: 52.97375416129459 - task: type: PairClassification dataset: type: PL-MTEB/cdsce-pairclassification name: MTEB CDSC-E config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 88.8 - type: cos_sim_ap value: 76.68028778789487 - type: cos_sim_f1 value: 66.82352941176471 - type: cos_sim_precision value: 60.42553191489362 - type: cos_sim_recall value: 74.73684210526315 - type: dot_accuracy value: 88.1 - type: dot_ap value: 72.04910086070551 - type: dot_f1 value: 66.66666666666667 - type: dot_precision value: 69.31818181818183 - type: dot_recall value: 64.21052631578948 - type: euclidean_accuracy value: 88.8 - type: euclidean_ap value: 76.63591858340688 - type: euclidean_f1 value: 67.13286713286713 - type: euclidean_precision value: 60.25104602510461 - type: euclidean_recall value: 75.78947368421053 - type: manhattan_accuracy value: 88.9 - type: manhattan_ap value: 76.54552849815124 - type: manhattan_f1 value: 66.66666666666667 - type: manhattan_precision value: 60.51502145922747 - type: manhattan_recall value: 74.21052631578947 - type: max_accuracy value: 88.9 - type: max_ap value: 76.68028778789487 - type: max_f1 value: 67.13286713286713 - task: type: STS dataset: type: PL-MTEB/cdscr-sts name: MTEB CDSC-R config: default split: test revision: None metrics: - type: cos_sim_pearson value: 91.64169404461497 - type: cos_sim_spearman value: 91.9755161377078 - type: euclidean_pearson value: 90.87481478491249 - type: euclidean_spearman value: 91.92362666383987 - type: manhattan_pearson value: 90.8415510499638 - type: manhattan_spearman value: 91.85927127194698 - task: type: Retrieval dataset: type: dbpedia-pl name: MTEB DBPedia-PL config: default split: test revision: None metrics: - type: map_at_1 value: 6.148 - type: map_at_10 value: 12.870999999999999 - type: map_at_100 value: 18.04 - type: map_at_1000 value: 19.286 - type: map_at_3 value: 9.156 - type: map_at_5 value: 10.857999999999999 - type: mrr_at_1 value: 53.25 - type: mrr_at_10 value: 61.016999999999996 - type: mrr_at_100 value: 61.48400000000001 - type: mrr_at_1000 value: 61.507999999999996 - type: mrr_at_3 value: 58.75 - type: mrr_at_5 value: 60.375 - type: ndcg_at_1 value: 41.0 - type: ndcg_at_10 value: 30.281000000000002 - type: ndcg_at_100 value: 33.955999999999996 - type: ndcg_at_1000 value: 40.77 - type: ndcg_at_3 value: 34.127 - type: ndcg_at_5 value: 32.274 - type: precision_at_1 value: 52.5 - type: precision_at_10 value: 24.525 - type: precision_at_100 value: 8.125 - type: precision_at_1000 value: 1.728 - type: precision_at_3 value: 37.083 - type: precision_at_5 value: 32.15 - type: recall_at_1 value: 6.148 - type: recall_at_10 value: 17.866 - type: recall_at_100 value: 39.213 - type: recall_at_1000 value: 61.604000000000006 - type: recall_at_3 value: 10.084 - type: recall_at_5 value: 13.333999999999998 - task: type: Retrieval dataset: type: fiqa-pl name: MTEB FiQA-PL config: default split: test revision: None metrics: - type: map_at_1 value: 14.643 - type: map_at_10 value: 23.166 - type: map_at_100 value: 24.725 - type: map_at_1000 value: 24.92 - type: map_at_3 value: 20.166 - type: map_at_5 value: 22.003 - type: mrr_at_1 value: 29.630000000000003 - type: mrr_at_10 value: 37.632 - type: mrr_at_100 value: 38.512 - type: mrr_at_1000 value: 38.578 - type: mrr_at_3 value: 35.391 - type: mrr_at_5 value: 36.857 - type: ndcg_at_1 value: 29.166999999999998 - type: ndcg_at_10 value: 29.749 - type: ndcg_at_100 value: 35.983 - type: ndcg_at_1000 value: 39.817 - type: ndcg_at_3 value: 26.739 - type: ndcg_at_5 value: 27.993000000000002 - type: precision_at_1 value: 29.166999999999998 - type: precision_at_10 value: 8.333 - type: precision_at_100 value: 1.448 - type: precision_at_1000 value: 0.213 - type: precision_at_3 value: 17.747 - type: precision_at_5 value: 13.58 - type: recall_at_1 value: 14.643 - type: recall_at_10 value: 35.247 - type: recall_at_100 value: 59.150999999999996 - type: recall_at_1000 value: 82.565 - type: recall_at_3 value: 24.006 - type: recall_at_5 value: 29.383 - task: type: Retrieval dataset: type: hotpotqa-pl name: MTEB HotpotQA-PL config: default split: test revision: None metrics: - type: map_at_1 value: 32.627 - type: map_at_10 value: 48.041 - type: map_at_100 value: 49.008 - type: map_at_1000 value: 49.092999999999996 - type: map_at_3 value: 44.774 - type: map_at_5 value: 46.791 - type: mrr_at_1 value: 65.28 - type: mrr_at_10 value: 72.53500000000001 - type: mrr_at_100 value: 72.892 - type: mrr_at_1000 value: 72.909 - type: mrr_at_3 value: 71.083 - type: mrr_at_5 value: 71.985 - type: ndcg_at_1 value: 65.253 - type: ndcg_at_10 value: 57.13700000000001 - type: ndcg_at_100 value: 60.783 - type: ndcg_at_1000 value: 62.507000000000005 - type: ndcg_at_3 value: 52.17 - type: ndcg_at_5 value: 54.896 - type: precision_at_1 value: 65.253 - type: precision_at_10 value: 12.088000000000001 - type: precision_at_100 value: 1.496 - type: precision_at_1000 value: 0.172 - type: precision_at_3 value: 32.96 - type: precision_at_5 value: 21.931 - type: recall_at_1 value: 32.627 - type: recall_at_10 value: 60.439 - type: recall_at_100 value: 74.80799999999999 - type: recall_at_1000 value: 86.219 - type: recall_at_3 value: 49.44 - type: recall_at_5 value: 54.827999999999996 - task: type: Retrieval dataset: type: msmarco-pl name: MTEB MSMARCO-PL config: default split: validation revision: None metrics: - type: map_at_1 value: 13.150999999999998 - type: map_at_10 value: 21.179000000000002 - type: map_at_100 value: 22.227 - type: map_at_1000 value: 22.308 - type: map_at_3 value: 18.473 - type: map_at_5 value: 19.942999999999998 - type: mrr_at_1 value: 13.467 - type: mrr_at_10 value: 21.471 - type: mrr_at_100 value: 22.509 - type: mrr_at_1000 value: 22.585 - type: mrr_at_3 value: 18.789 - type: mrr_at_5 value: 20.262 - type: ndcg_at_1 value: 13.539000000000001 - type: ndcg_at_10 value: 25.942999999999998 - type: ndcg_at_100 value: 31.386999999999997 - type: ndcg_at_1000 value: 33.641 - type: ndcg_at_3 value: 20.368 - type: ndcg_at_5 value: 23.003999999999998 - type: precision_at_1 value: 13.539000000000001 - type: precision_at_10 value: 4.249 - type: precision_at_100 value: 0.7040000000000001 - type: precision_at_1000 value: 0.09 - type: precision_at_3 value: 8.782 - type: precision_at_5 value: 6.6049999999999995 - type: recall_at_1 value: 13.150999999999998 - type: recall_at_10 value: 40.698 - type: recall_at_100 value: 66.71000000000001 - type: recall_at_1000 value: 84.491 - type: recall_at_3 value: 25.452 - type: recall_at_5 value: 31.791000000000004 - task: type: Classification dataset: type: mteb/amazon_massive_intent name: MTEB MassiveIntentClassification (pl) config: pl split: test revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 metrics: - type: accuracy value: 67.3537323470074 - type: f1 value: 64.67852047603644 - task: type: Classification dataset: type: mteb/amazon_massive_scenario name: MTEB MassiveScenarioClassification (pl) config: pl split: test revision: 7d571f92784cd94a019292a1f45445077d0ef634 metrics: - type: accuracy value: 72.12508406186953 - type: f1 value: 71.55887309568853 - task: type: Retrieval dataset: type: nfcorpus-pl name: MTEB NFCorpus-PL config: default split: test revision: None metrics: - type: map_at_1 value: 4.18 - type: map_at_10 value: 9.524000000000001 - type: map_at_100 value: 12.272 - type: map_at_1000 value: 13.616 - type: map_at_3 value: 6.717 - type: map_at_5 value: 8.172 - type: mrr_at_1 value: 37.152 - type: mrr_at_10 value: 45.068000000000005 - type: mrr_at_100 value: 46.026 - type: mrr_at_1000 value: 46.085 - type: mrr_at_3 value: 43.344 - type: mrr_at_5 value: 44.412 - type: ndcg_at_1 value: 34.52 - type: ndcg_at_10 value: 27.604 - type: ndcg_at_100 value: 26.012999999999998 - type: ndcg_at_1000 value: 35.272 - type: ndcg_at_3 value: 31.538 - type: ndcg_at_5 value: 30.165999999999997 - type: precision_at_1 value: 36.223 - type: precision_at_10 value: 21.053 - type: precision_at_100 value: 7.08 - type: precision_at_1000 value: 1.9929999999999999 - type: precision_at_3 value: 30.031000000000002 - type: precision_at_5 value: 26.997 - type: recall_at_1 value: 4.18 - type: recall_at_10 value: 12.901000000000002 - type: recall_at_100 value: 27.438000000000002 - type: recall_at_1000 value: 60.768 - type: recall_at_3 value: 7.492 - type: recall_at_5 value: 10.05 - task: type: Retrieval dataset: type: nq-pl name: MTEB NQ-PL config: default split: test revision: None metrics: - type: map_at_1 value: 17.965 - type: map_at_10 value: 28.04 - type: map_at_100 value: 29.217 - type: map_at_1000 value: 29.285 - type: map_at_3 value: 24.818 - type: map_at_5 value: 26.617 - type: mrr_at_1 value: 20.22 - type: mrr_at_10 value: 30.148000000000003 - type: mrr_at_100 value: 31.137999999999998 - type: mrr_at_1000 value: 31.19 - type: mrr_at_3 value: 27.201999999999998 - type: mrr_at_5 value: 28.884999999999998 - type: ndcg_at_1 value: 20.365 - type: ndcg_at_10 value: 33.832 - type: ndcg_at_100 value: 39.33 - type: ndcg_at_1000 value: 41.099999999999994 - type: ndcg_at_3 value: 27.46 - type: ndcg_at_5 value: 30.584 - type: precision_at_1 value: 20.365 - type: precision_at_10 value: 5.849 - type: precision_at_100 value: 0.8959999999999999 - type: precision_at_1000 value: 0.107 - type: precision_at_3 value: 12.64 - type: precision_at_5 value: 9.334000000000001 - type: recall_at_1 value: 17.965 - type: recall_at_10 value: 49.503 - type: recall_at_100 value: 74.351 - type: recall_at_1000 value: 87.766 - type: recall_at_3 value: 32.665 - type: recall_at_5 value: 39.974 - task: type: Classification dataset: type: laugustyniak/abusive-clauses-pl name: MTEB PAC config: default split: test revision: None metrics: - type: accuracy value: 63.11323486823051 - type: ap value: 74.53486257377787 - type: f1 value: 60.631005373417736 - task: type: PairClassification dataset: type: PL-MTEB/ppc-pairclassification name: MTEB PPC config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 80.10000000000001 - type: cos_sim_ap value: 89.69526236458292 - type: cos_sim_f1 value: 83.37468982630274 - type: cos_sim_precision value: 83.30578512396694 - type: cos_sim_recall value: 83.44370860927152 - type: dot_accuracy value: 77.8 - type: dot_ap value: 87.72366051496104 - type: dot_f1 value: 82.83752860411899 - type: dot_precision value: 76.80339462517681 - type: dot_recall value: 89.90066225165563 - type: euclidean_accuracy value: 80.10000000000001 - type: euclidean_ap value: 89.61317191870039 - type: euclidean_f1 value: 83.40214698596202 - type: euclidean_precision value: 83.19604612850083 - type: euclidean_recall value: 83.6092715231788 - type: manhattan_accuracy value: 79.60000000000001 - type: manhattan_ap value: 89.48363786968471 - type: manhattan_f1 value: 82.96296296296296 - type: manhattan_precision value: 82.48772504091653 - type: manhattan_recall value: 83.44370860927152 - type: max_accuracy value: 80.10000000000001 - type: max_ap value: 89.69526236458292 - type: max_f1 value: 83.40214698596202 - task: type: PairClassification dataset: type: PL-MTEB/psc-pairclassification name: MTEB PSC config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 96.93877551020408 - type: cos_sim_ap value: 98.86489482248999 - type: cos_sim_f1 value: 95.11111111111113 - type: cos_sim_precision value: 92.507204610951 - type: cos_sim_recall value: 97.86585365853658 - type: dot_accuracy value: 95.73283858998145 - type: dot_ap value: 97.8261652492545 - type: dot_f1 value: 93.21533923303835 - type: dot_precision value: 90.28571428571428 - type: dot_recall value: 96.34146341463415 - type: euclidean_accuracy value: 96.93877551020408 - type: euclidean_ap value: 98.84837797066623 - type: euclidean_f1 value: 95.11111111111113 - type: euclidean_precision value: 92.507204610951 - type: euclidean_recall value: 97.86585365853658 - type: manhattan_accuracy value: 96.84601113172542 - type: manhattan_ap value: 98.78659090944161 - type: manhattan_f1 value: 94.9404761904762 - type: manhattan_precision value: 92.73255813953489 - type: manhattan_recall value: 97.2560975609756 - type: max_accuracy value: 96.93877551020408 - type: max_ap value: 98.86489482248999 - type: max_f1 value: 95.11111111111113 - task: type: Classification dataset: type: PL-MTEB/polemo2_in name: MTEB PolEmo2.0-IN config: default split: test revision: None metrics: - type: accuracy value: 63.961218836565095 - type: f1 value: 64.3979989243291 - task: type: Classification dataset: type: PL-MTEB/polemo2_out name: MTEB PolEmo2.0-OUT config: default split: test revision: None metrics: - type: accuracy value: 40.32388663967612 - type: f1 value: 32.339117999015755 - task: type: Retrieval dataset: type: quora-pl name: MTEB Quora-PL config: default split: test revision: None metrics: - type: map_at_1 value: 62.757 - type: map_at_10 value: 76.55999999999999 - type: map_at_100 value: 77.328 - type: map_at_1000 value: 77.35499999999999 - type: map_at_3 value: 73.288 - type: map_at_5 value: 75.25500000000001 - type: mrr_at_1 value: 72.28 - type: mrr_at_10 value: 79.879 - type: mrr_at_100 value: 80.121 - type: mrr_at_1000 value: 80.12700000000001 - type: mrr_at_3 value: 78.40700000000001 - type: mrr_at_5 value: 79.357 - type: ndcg_at_1 value: 72.33000000000001 - type: ndcg_at_10 value: 81.151 - type: ndcg_at_100 value: 83.107 - type: ndcg_at_1000 value: 83.397 - type: ndcg_at_3 value: 77.3 - type: ndcg_at_5 value: 79.307 - type: precision_at_1 value: 72.33000000000001 - type: precision_at_10 value: 12.587000000000002 - type: precision_at_100 value: 1.488 - type: precision_at_1000 value: 0.155 - type: precision_at_3 value: 33.943 - type: precision_at_5 value: 22.61 - type: recall_at_1 value: 62.757 - type: recall_at_10 value: 90.616 - type: recall_at_100 value: 97.905 - type: recall_at_1000 value: 99.618 - type: recall_at_3 value: 79.928 - type: recall_at_5 value: 85.30499999999999 - task: type: Retrieval dataset: type: scidocs-pl name: MTEB SCIDOCS-PL config: default split: test revision: None metrics: - type: map_at_1 value: 3.313 - type: map_at_10 value: 8.559999999999999 - type: map_at_100 value: 10.177999999999999 - type: map_at_1000 value: 10.459999999999999 - type: map_at_3 value: 6.094 - type: map_at_5 value: 7.323 - type: mrr_at_1 value: 16.3 - type: mrr_at_10 value: 25.579 - type: mrr_at_100 value: 26.717000000000002 - type: mrr_at_1000 value: 26.799 - type: mrr_at_3 value: 22.583000000000002 - type: mrr_at_5 value: 24.298000000000002 - type: ndcg_at_1 value: 16.3 - type: ndcg_at_10 value: 14.789 - type: ndcg_at_100 value: 21.731 - type: ndcg_at_1000 value: 27.261999999999997 - type: ndcg_at_3 value: 13.74 - type: ndcg_at_5 value: 12.199 - type: precision_at_1 value: 16.3 - type: precision_at_10 value: 7.779999999999999 - type: precision_at_100 value: 1.79 - type: precision_at_1000 value: 0.313 - type: precision_at_3 value: 12.933 - type: precision_at_5 value: 10.86 - type: recall_at_1 value: 3.313 - type: recall_at_10 value: 15.772 - type: recall_at_100 value: 36.392 - type: recall_at_1000 value: 63.525 - type: recall_at_3 value: 7.863 - type: recall_at_5 value: 11.003 - task: type: PairClassification dataset: type: PL-MTEB/sicke-pl-pairclassification name: MTEB SICK-E-PL config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 81.7977986139421 - type: cos_sim_ap value: 73.21294750778902 - type: cos_sim_f1 value: 66.57391304347826 - type: cos_sim_precision value: 65.05778382053025 - type: cos_sim_recall value: 68.16239316239316 - type: dot_accuracy value: 78.67916836526702 - type: dot_ap value: 63.61943815978181 - type: dot_f1 value: 62.45014245014245 - type: dot_precision value: 52.04178537511871 - type: dot_recall value: 78.06267806267806 - type: euclidean_accuracy value: 81.7774154097024 - type: euclidean_ap value: 73.25053778387148 - type: euclidean_f1 value: 66.55064392620953 - type: euclidean_precision value: 65.0782845473111 - type: euclidean_recall value: 68.09116809116809 - type: manhattan_accuracy value: 81.63473298002447 - type: manhattan_ap value: 72.99781945530033 - type: manhattan_f1 value: 66.3623595505618 - type: manhattan_precision value: 65.4432132963989 - type: manhattan_recall value: 67.3076923076923 - type: max_accuracy value: 81.7977986139421 - type: max_ap value: 73.25053778387148 - type: max_f1 value: 66.57391304347826 - task: type: STS dataset: type: PL-MTEB/sickr-pl-sts name: MTEB SICK-R-PL config: default split: test revision: None metrics: - type: cos_sim_pearson value: 79.62332929388755 - type: cos_sim_spearman value: 73.70598290849304 - type: euclidean_pearson value: 77.3603286710006 - type: euclidean_spearman value: 73.74420279933932 - type: manhattan_pearson value: 77.12735032552482 - type: manhattan_spearman value: 73.53014836690127 - task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (pl) config: pl split: test revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 metrics: - type: cos_sim_pearson value: 37.696942928686724 - type: cos_sim_spearman value: 40.6271445245692 - type: euclidean_pearson value: 30.212734461370832 - type: euclidean_spearman value: 40.66643376699638 - type: manhattan_pearson value: 29.90223716230108 - type: manhattan_spearman value: 40.35576319091178 - task: type: Retrieval dataset: type: scifact-pl name: MTEB SciFact-PL config: default split: test revision: None metrics: - type: map_at_1 value: 43.528 - type: map_at_10 value: 53.290000000000006 - type: map_at_100 value: 54.342 - type: map_at_1000 value: 54.376999999999995 - type: map_at_3 value: 50.651999999999994 - type: map_at_5 value: 52.248000000000005 - type: mrr_at_1 value: 46.666999999999994 - type: mrr_at_10 value: 55.286 - type: mrr_at_100 value: 56.094 - type: mrr_at_1000 value: 56.125 - type: mrr_at_3 value: 53.222 - type: mrr_at_5 value: 54.339000000000006 - type: ndcg_at_1 value: 46.0 - type: ndcg_at_10 value: 58.142 - type: ndcg_at_100 value: 62.426 - type: ndcg_at_1000 value: 63.395999999999994 - type: ndcg_at_3 value: 53.53 - type: ndcg_at_5 value: 55.842000000000006 - type: precision_at_1 value: 46.0 - type: precision_at_10 value: 7.9670000000000005 - type: precision_at_100 value: 1.023 - type: precision_at_1000 value: 0.11100000000000002 - type: precision_at_3 value: 21.444 - type: precision_at_5 value: 14.333000000000002 - type: recall_at_1 value: 43.528 - type: recall_at_10 value: 71.511 - type: recall_at_100 value: 89.93299999999999 - type: recall_at_1000 value: 97.667 - type: recall_at_3 value: 59.067 - type: recall_at_5 value: 64.789 - task: type: Retrieval dataset: type: trec-covid-pl name: MTEB TRECCOVID-PL config: default split: test revision: None metrics: - type: map_at_1 value: 0.22699999999999998 - type: map_at_10 value: 1.3379999999999999 - type: map_at_100 value: 6.965000000000001 - type: map_at_1000 value: 17.135 - type: map_at_3 value: 0.53 - type: map_at_5 value: 0.799 - type: mrr_at_1 value: 84.0 - type: mrr_at_10 value: 88.083 - type: mrr_at_100 value: 88.432 - type: mrr_at_1000 value: 88.432 - type: mrr_at_3 value: 87.333 - type: mrr_at_5 value: 87.833 - type: ndcg_at_1 value: 76.0 - type: ndcg_at_10 value: 58.199 - type: ndcg_at_100 value: 43.230000000000004 - type: ndcg_at_1000 value: 39.751 - type: ndcg_at_3 value: 63.743 - type: ndcg_at_5 value: 60.42999999999999 - type: precision_at_1 value: 84.0 - type: precision_at_10 value: 62.0 - type: precision_at_100 value: 44.519999999999996 - type: precision_at_1000 value: 17.746000000000002 - type: precision_at_3 value: 67.333 - type: precision_at_5 value: 63.2 - type: recall_at_1 value: 0.22699999999999998 - type: recall_at_10 value: 1.627 - type: recall_at_100 value: 10.600999999999999 - type: recall_at_1000 value: 37.532 - type: recall_at_3 value: 0.547 - type: recall_at_5 value: 0.864 language: pl license: apache-2.0 widget: - source_sentence: "query: Jak dożyć 100 lat?" sentences: - "passage: Trzeba zdrowo się odżywiać i uprawiać sport." - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami." - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ---

MMLW-e5-small

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning. It transforms texts to 384 dimensional vectors. The model was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation. ## Usage (Sentence-Transformers) ⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️ You can use the model like this with [sentence-transformers](https://www.SBERT.net): ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim query_prefix = "query: " answer_prefix = "passage: " queries = [query_prefix + "Jak dożyć 100 lat?"] answers = [ answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.", answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ] model = SentenceTransformer("sdadas/mmlw-e5-small") queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False) answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False) best_answer = cos_sim(queries_emb, answers_emb).argmax().item() print(answers[best_answer]) # Trzeba zdrowo się odżywiać i uprawiać sport. ``` ## Evaluation Results - The model achieves an **Average Score** of **55.84** on the Polish Massive Text Embedding Benchmark (MTEB). See [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for detailed results. - The model achieves **NDCG@10** of **47.64** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. ## Acknowledgements This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative. ## Citation ```bibtex @article{dadas2024pirb, title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata}, year={2024}, eprint={2402.13350}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```