bge-m3-custom-fr / README.md
manu's picture
Update README.md
ed3ef88 verified
metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - mteb
model-index:
  - name: bge-m3-custom-fr
    results:
      - task:
          type: Clustering
        dataset:
          type: lyon-nlp/alloprof
          name: MTEB AlloProfClusteringP2P
          config: default
          split: test
          revision: 392ba3f5bcc8c51f578786c1fc3dae648662cb9b
        metrics:
          - type: v_measure
            value: 56.727459716713
      - task:
          type: Clustering
        dataset:
          type: lyon-nlp/alloprof
          name: MTEB AlloProfClusteringS2S
          config: default
          split: test
          revision: 392ba3f5bcc8c51f578786c1fc3dae648662cb9b
        metrics:
          - type: v_measure
            value: 38.19920006179227
      - task:
          type: Reranking
        dataset:
          type: lyon-nlp/mteb-fr-reranking-alloprof-s2p
          name: MTEB AlloprofReranking
          config: default
          split: test
          revision: e40c8a63ce02da43200eccb5b0846fcaa888f562
        metrics:
          - type: map
            value: 65.17465797499942
          - type: mrr
            value: 66.51400197384653
      - task:
          type: Retrieval
        dataset:
          type: lyon-nlp/alloprof
          name: MTEB AlloprofRetrieval
          config: default
          split: test
          revision: 2df7bee4080bedf2e97de3da6bd5c7bc9fc9c4d2
        metrics:
          - type: map_at_1
            value: 29.836000000000002
          - type: map_at_10
            value: 39.916000000000004
          - type: map_at_100
            value: 40.816
          - type: map_at_1000
            value: 40.877
          - type: map_at_3
            value: 37.294
          - type: map_at_5
            value: 38.838
          - type: mrr_at_1
            value: 29.836000000000002
          - type: mrr_at_10
            value: 39.916000000000004
          - type: mrr_at_100
            value: 40.816
          - type: mrr_at_1000
            value: 40.877
          - type: mrr_at_3
            value: 37.294
          - type: mrr_at_5
            value: 38.838
          - type: ndcg_at_1
            value: 29.836000000000002
          - type: ndcg_at_10
            value: 45.097
          - type: ndcg_at_100
            value: 49.683
          - type: ndcg_at_1000
            value: 51.429
          - type: ndcg_at_3
            value: 39.717
          - type: ndcg_at_5
            value: 42.501
          - type: precision_at_1
            value: 29.836000000000002
          - type: precision_at_10
            value: 6.149
          - type: precision_at_100
            value: 0.8340000000000001
          - type: precision_at_1000
            value: 0.097
          - type: precision_at_3
            value: 15.576
          - type: precision_at_5
            value: 10.698
          - type: recall_at_1
            value: 29.836000000000002
          - type: recall_at_10
            value: 61.485
          - type: recall_at_100
            value: 83.428
          - type: recall_at_1000
            value: 97.461
          - type: recall_at_3
            value: 46.727000000000004
          - type: recall_at_5
            value: 53.489
      - task:
          type: Classification
        dataset:
          type: mteb/amazon_reviews_multi
          name: MTEB AmazonReviewsClassification (fr)
          config: fr
          split: test
          revision: 1399c76144fd37290681b995c656ef9b2e06e26d
        metrics:
          - type: accuracy
            value: 42.332
          - type: f1
            value: 40.801800929404344
      - task:
          type: Retrieval
        dataset:
          type: maastrichtlawtech/bsard
          name: MTEB BSARDRetrieval
          config: default
          split: test
          revision: 5effa1b9b5fa3b0f9e12523e6e43e5f86a6e6d59
        metrics:
          - type: map_at_1
            value: 0
          - type: map_at_10
            value: 0
          - type: map_at_100
            value: 0.011000000000000001
          - type: map_at_1000
            value: 0.018000000000000002
          - type: map_at_3
            value: 0
          - type: map_at_5
            value: 0
          - type: mrr_at_1
            value: 0
          - type: mrr_at_10
            value: 0
          - type: mrr_at_100
            value: 0.011000000000000001
          - type: mrr_at_1000
            value: 0.018000000000000002
          - type: mrr_at_3
            value: 0
          - type: mrr_at_5
            value: 0
          - type: ndcg_at_1
            value: 0
          - type: ndcg_at_10
            value: 0
          - type: ndcg_at_100
            value: 0.13999999999999999
          - type: ndcg_at_1000
            value: 0.457
          - type: ndcg_at_3
            value: 0
          - type: ndcg_at_5
            value: 0
          - type: precision_at_1
            value: 0
          - type: precision_at_10
            value: 0
          - type: precision_at_100
            value: 0.009000000000000001
          - type: precision_at_1000
            value: 0.004
          - type: precision_at_3
            value: 0
          - type: precision_at_5
            value: 0
          - type: recall_at_1
            value: 0
          - type: recall_at_10
            value: 0
          - type: recall_at_100
            value: 0.901
          - type: recall_at_1000
            value: 3.604
          - type: recall_at_3
            value: 0
          - type: recall_at_5
            value: 0
      - task:
          type: Clustering
        dataset:
          type: lyon-nlp/clustering-hal-s2s
          name: MTEB HALClusteringS2S
          config: default
          split: test
          revision: e06ebbbb123f8144bef1a5d18796f3dec9ae2915
        metrics:
          - type: v_measure
            value: 24.1294565929144
      - task:
          type: Clustering
        dataset:
          type: mlsum
          name: MTEB MLSUMClusteringP2P
          config: default
          split: test
          revision: b5d54f8f3b61ae17845046286940f03c6bc79bc7
        metrics:
          - type: v_measure
            value: 42.12040762356958
      - task:
          type: Clustering
        dataset:
          type: mlsum
          name: MTEB MLSUMClusteringS2S
          config: default
          split: test
          revision: b5d54f8f3b61ae17845046286940f03c6bc79bc7
        metrics:
          - type: v_measure
            value: 36.69102548662494
      - task:
          type: Classification
        dataset:
          type: mteb/mtop_domain
          name: MTEB MTOPDomainClassification (fr)
          config: fr
          split: test
          revision: d80d48c1eb48d3562165c59d59d0034df9fff0bf
        metrics:
          - type: accuracy
            value: 90.3946132164109
          - type: f1
            value: 90.15608090764273
      - task:
          type: Classification
        dataset:
          type: mteb/mtop_intent
          name: MTEB MTOPIntentClassification (fr)
          config: fr
          split: test
          revision: ae001d0e6b1228650b7bd1c2c65fb50ad11a8aba
        metrics:
          - type: accuracy
            value: 60.87691825869088
          - type: f1
            value: 43.56160799721332
      - task:
          type: Classification
        dataset:
          type: masakhane/masakhanews
          name: MTEB MasakhaNEWSClassification (fra)
          config: fra
          split: test
          revision: 8ccc72e69e65f40c70e117d8b3c08306bb788b60
        metrics:
          - type: accuracy
            value: 70.52132701421802
          - type: f1
            value: 66.7911493789742
      - task:
          type: Clustering
        dataset:
          type: masakhane/masakhanews
          name: MTEB MasakhaNEWSClusteringP2P (fra)
          config: fra
          split: test
          revision: 8ccc72e69e65f40c70e117d8b3c08306bb788b60
        metrics:
          - type: v_measure
            value: 34.60975901092521
      - task:
          type: Clustering
        dataset:
          type: masakhane/masakhanews
          name: MTEB MasakhaNEWSClusteringS2S (fra)
          config: fra
          split: test
          revision: 8ccc72e69e65f40c70e117d8b3c08306bb788b60
        metrics:
          - type: v_measure
            value: 32.8092912406207
      - task:
          type: Classification
        dataset:
          type: mteb/amazon_massive_intent
          name: MTEB MassiveIntentClassification (fr)
          config: fr
          split: test
          revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
        metrics:
          - type: accuracy
            value: 66.70477471418964
          - type: f1
            value: 64.4848306188641
      - task:
          type: Classification
        dataset:
          type: mteb/amazon_massive_scenario
          name: MTEB MassiveScenarioClassification (fr)
          config: fr
          split: test
          revision: 7d571f92784cd94a019292a1f45445077d0ef634
        metrics:
          - type: accuracy
            value: 74.57969065232011
          - type: f1
            value: 73.58251655418402
      - task:
          type: Retrieval
        dataset:
          type: jinaai/mintakaqa
          name: MTEB MintakaRetrieval (fr)
          config: fr
          split: test
          revision: efa78cc2f74bbcd21eff2261f9e13aebe40b814e
        metrics:
          - type: map_at_1
            value: 14.005
          - type: map_at_10
            value: 21.279999999999998
          - type: map_at_100
            value: 22.288
          - type: map_at_1000
            value: 22.404
          - type: map_at_3
            value: 19.151
          - type: map_at_5
            value: 20.322000000000003
          - type: mrr_at_1
            value: 14.005
          - type: mrr_at_10
            value: 21.279999999999998
          - type: mrr_at_100
            value: 22.288
          - type: mrr_at_1000
            value: 22.404
          - type: mrr_at_3
            value: 19.151
          - type: mrr_at_5
            value: 20.322000000000003
          - type: ndcg_at_1
            value: 14.005
          - type: ndcg_at_10
            value: 25.173000000000002
          - type: ndcg_at_100
            value: 30.452
          - type: ndcg_at_1000
            value: 34.241
          - type: ndcg_at_3
            value: 20.768
          - type: ndcg_at_5
            value: 22.869
          - type: precision_at_1
            value: 14.005
          - type: precision_at_10
            value: 3.759
          - type: precision_at_100
            value: 0.631
          - type: precision_at_1000
            value: 0.095
          - type: precision_at_3
            value: 8.477
          - type: precision_at_5
            value: 6.101999999999999
          - type: recall_at_1
            value: 14.005
          - type: recall_at_10
            value: 37.592
          - type: recall_at_100
            value: 63.144999999999996
          - type: recall_at_1000
            value: 94.513
          - type: recall_at_3
            value: 25.430000000000003
          - type: recall_at_5
            value: 30.508000000000003
      - task:
          type: PairClassification
        dataset:
          type: GEM/opusparcus
          name: MTEB OpusparcusPC (fr)
          config: fr
          split: test
          revision: 9e9b1f8ef51616073f47f306f7f47dd91663f86a
        metrics:
          - type: cos_sim_accuracy
            value: 81.60762942779292
          - type: cos_sim_ap
            value: 93.33850264444463
          - type: cos_sim_f1
            value: 87.24705882352941
          - type: cos_sim_precision
            value: 82.91592128801432
          - type: cos_sim_recall
            value: 92.05561072492551
          - type: dot_accuracy
            value: 81.60762942779292
          - type: dot_ap
            value: 93.33850264444463
          - type: dot_f1
            value: 87.24705882352941
          - type: dot_precision
            value: 82.91592128801432
          - type: dot_recall
            value: 92.05561072492551
          - type: euclidean_accuracy
            value: 81.60762942779292
          - type: euclidean_ap
            value: 93.3384939260791
          - type: euclidean_f1
            value: 87.24705882352941
          - type: euclidean_precision
            value: 82.91592128801432
          - type: euclidean_recall
            value: 92.05561072492551
          - type: manhattan_accuracy
            value: 81.60762942779292
          - type: manhattan_ap
            value: 93.27064794794664
          - type: manhattan_f1
            value: 87.27440999537251
          - type: manhattan_precision
            value: 81.7157712305026
          - type: manhattan_recall
            value: 93.64448857994041
          - type: max_accuracy
            value: 81.60762942779292
          - type: max_ap
            value: 93.33850264444463
          - type: max_f1
            value: 87.27440999537251
      - task:
          type: PairClassification
        dataset:
          type: paws-x
          name: MTEB PawsX (fr)
          config: fr
          split: test
          revision: 8a04d940a42cd40658986fdd8e3da561533a3646
        metrics:
          - type: cos_sim_accuracy
            value: 61.95
          - type: cos_sim_ap
            value: 60.8497942066519
          - type: cos_sim_f1
            value: 62.53032928942807
          - type: cos_sim_precision
            value: 45.50958627648839
          - type: cos_sim_recall
            value: 99.88925802879291
          - type: dot_accuracy
            value: 61.95
          - type: dot_ap
            value: 60.83772617132806
          - type: dot_f1
            value: 62.53032928942807
          - type: dot_precision
            value: 45.50958627648839
          - type: dot_recall
            value: 99.88925802879291
          - type: euclidean_accuracy
            value: 61.95
          - type: euclidean_ap
            value: 60.8497942066519
          - type: euclidean_f1
            value: 62.53032928942807
          - type: euclidean_precision
            value: 45.50958627648839
          - type: euclidean_recall
            value: 99.88925802879291
          - type: manhattan_accuracy
            value: 61.9
          - type: manhattan_ap
            value: 60.87914286416435
          - type: manhattan_f1
            value: 62.491349480968864
          - type: manhattan_precision
            value: 45.44539506794162
          - type: manhattan_recall
            value: 100
          - type: max_accuracy
            value: 61.95
          - type: max_ap
            value: 60.87914286416435
          - type: max_f1
            value: 62.53032928942807
      - task:
          type: STS
        dataset:
          type: Lajavaness/SICK-fr
          name: MTEB SICKFr
          config: default
          split: test
          revision: e077ab4cf4774a1e36d86d593b150422fafd8e8a
        metrics:
          - type: cos_sim_pearson
            value: 81.24400370393097
          - type: cos_sim_spearman
            value: 75.50548831172674
          - type: euclidean_pearson
            value: 77.81039134726188
          - type: euclidean_spearman
            value: 75.50504199480463
          - type: manhattan_pearson
            value: 77.79383923445839
          - type: manhattan_spearman
            value: 75.472882776806
      - task:
          type: STS
        dataset:
          type: mteb/sts22-crosslingual-sts
          name: MTEB STS22 (fr)
          config: fr
          split: test
          revision: eea2b4fe26a775864c896887d910b76a8098ad3f
        metrics:
          - type: cos_sim_pearson
            value: 80.48474973785514
          - type: cos_sim_spearman
            value: 81.69566405041475
          - type: euclidean_pearson
            value: 78.32784472269549
          - type: euclidean_spearman
            value: 81.69566405041475
          - type: manhattan_pearson
            value: 78.2856100079857
          - type: manhattan_spearman
            value: 81.84463256785325
      - task:
          type: STS
        dataset:
          type: PhilipMay/stsb_multi_mt
          name: MTEB STSBenchmarkMultilingualSTS (fr)
          config: fr
          split: test
          revision: 93d57ef91790589e3ce9c365164337a8a78b7632
        metrics:
          - type: cos_sim_pearson
            value: 80.68785966129913
          - type: cos_sim_spearman
            value: 81.29936344904975
          - type: euclidean_pearson
            value: 80.25462090186443
          - type: euclidean_spearman
            value: 81.29928746010391
          - type: manhattan_pearson
            value: 80.17083094559602
          - type: manhattan_spearman
            value: 81.18921827402406
      - task:
          type: Summarization
        dataset:
          type: lyon-nlp/summarization-summeval-fr-p2p
          name: MTEB SummEvalFr
          config: default
          split: test
          revision: b385812de6a9577b6f4d0f88c6a6e35395a94054
        metrics:
          - type: cos_sim_pearson
            value: 31.66113105701837
          - type: cos_sim_spearman
            value: 30.13316633681715
          - type: dot_pearson
            value: 31.66113064418324
          - type: dot_spearman
            value: 30.13316633681715
      - task:
          type: Reranking
        dataset:
          type: lyon-nlp/mteb-fr-reranking-syntec-s2p
          name: MTEB SyntecReranking
          config: default
          split: test
          revision: b205c5084a0934ce8af14338bf03feb19499c84d
        metrics:
          - type: map
            value: 85.43333333333334
          - type: mrr
            value: 85.43333333333334
      - task:
          type: Retrieval
        dataset:
          type: lyon-nlp/mteb-fr-retrieval-syntec-s2p
          name: MTEB SyntecRetrieval
          config: default
          split: test
          revision: aa460cd4d177e6a3c04fcd2affd95e8243289033
        metrics:
          - type: map_at_1
            value: 65
          - type: map_at_10
            value: 75.19200000000001
          - type: map_at_100
            value: 75.77000000000001
          - type: map_at_1000
            value: 75.77000000000001
          - type: map_at_3
            value: 73.667
          - type: map_at_5
            value: 75.067
          - type: mrr_at_1
            value: 65
          - type: mrr_at_10
            value: 75.19200000000001
          - type: mrr_at_100
            value: 75.77000000000001
          - type: mrr_at_1000
            value: 75.77000000000001
          - type: mrr_at_3
            value: 73.667
          - type: mrr_at_5
            value: 75.067
          - type: ndcg_at_1
            value: 65
          - type: ndcg_at_10
            value: 79.145
          - type: ndcg_at_100
            value: 81.34400000000001
          - type: ndcg_at_1000
            value: 81.34400000000001
          - type: ndcg_at_3
            value: 76.333
          - type: ndcg_at_5
            value: 78.82900000000001
          - type: precision_at_1
            value: 65
          - type: precision_at_10
            value: 9.1
          - type: precision_at_100
            value: 1
          - type: precision_at_1000
            value: 0.1
          - type: precision_at_3
            value: 28.000000000000004
          - type: precision_at_5
            value: 18
          - type: recall_at_1
            value: 65
          - type: recall_at_10
            value: 91
          - type: recall_at_100
            value: 100
          - type: recall_at_1000
            value: 100
          - type: recall_at_3
            value: 84
          - type: recall_at_5
            value: 90
      - task:
          type: Retrieval
        dataset:
          type: jinaai/xpqa
          name: MTEB XPQARetrieval (fr)
          config: fr
          split: test
          revision: c99d599f0a6ab9b85b065da6f9d94f9cf731679f
        metrics:
          - type: map_at_1
            value: 40.225
          - type: map_at_10
            value: 61.833000000000006
          - type: map_at_100
            value: 63.20400000000001
          - type: map_at_1000
            value: 63.27
          - type: map_at_3
            value: 55.593
          - type: map_at_5
            value: 59.65200000000001
          - type: mrr_at_1
            value: 63.284
          - type: mrr_at_10
            value: 71.351
          - type: mrr_at_100
            value: 71.772
          - type: mrr_at_1000
            value: 71.786
          - type: mrr_at_3
            value: 69.381
          - type: mrr_at_5
            value: 70.703
          - type: ndcg_at_1
            value: 63.284
          - type: ndcg_at_10
            value: 68.49199999999999
          - type: ndcg_at_100
            value: 72.79299999999999
          - type: ndcg_at_1000
            value: 73.735
          - type: ndcg_at_3
            value: 63.278
          - type: ndcg_at_5
            value: 65.19200000000001
          - type: precision_at_1
            value: 63.284
          - type: precision_at_10
            value: 15.661
          - type: precision_at_100
            value: 1.9349999999999998
          - type: precision_at_1000
            value: 0.207
          - type: precision_at_3
            value: 38.273
          - type: precision_at_5
            value: 27.397
          - type: recall_at_1
            value: 40.225
          - type: recall_at_10
            value: 77.66999999999999
          - type: recall_at_100
            value: 93.887
          - type: recall_at_1000
            value: 99.70599999999999
          - type: recall_at_3
            value: 61.133
          - type: recall_at_5
            value: 69.789

{MODEL_NAME}

This is a sentence-transformers model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Citing & Authors