acayir64's picture
Upload folder using huggingface_hub
3fe70ae verified
|
raw
history blame
27.4 kB
metadata
language:
  - multilingual
  - zh
  - ja
  - ar
  - ko
  - de
  - fr
  - es
  - pt
  - hi
  - id
  - it
  - tr
  - ru
  - bn
  - ur
  - mr
  - ta
  - vi
  - fa
  - pl
  - uk
  - nl
  - sv
  - he
  - sw
  - ps
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dataset_size:10K<n<100K
  - loss:CoSENTLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
widget:
  - source_sentence: Is that wrong?
    sentences:
      - Is that such a terrible thing?
      - Kennedy korkunç bir savcıydı.
      - Tom bir davada tanıklık ediyordu.
  - source_sentence: Orada mıydılar?
    sentences:
      - Were they in there?
      - İlki ikincisini anlamlı kılar.
      - Alerji tedavisi gelişiyor.
  - source_sentence: He is not alone
    sentences:
      - It is not confusing
      - The Hawks were humanitarians.
      - Tom bir davada tanıklık ediyordu.
  - source_sentence: Yaptığın şey bu.
    sentences:
      - Onurlu işler yapıyorsunuz.
      - Weisberg azınlık adına konuştu.
      - Robert Ferrigno Kaliforniya'da doğdu.
  - source_sentence: Ben vatansızım.
    sentences:
      - I am stateless.
      - Kendi tekniğini tercih ediyor.
      - Mermiler camdan fırladı.
pipeline_tag: sentence-similarity
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: tr ling
          type: tr_ling
        metrics:
          - type: pearson_cosine
            value: 0.037604255015168134
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.04804112988506346
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.034740275152181296
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.03769766156967754
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.03698411306484619
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.03903062430281842
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.0673696846368413
            name: Pearson Dot
          - type: spearman_dot
            value: 0.06818119362900125
            name: Spearman Dot
          - type: pearson_max
            value: 0.0673696846368413
            name: Pearson Max
          - type: spearman_max
            value: 0.06818119362900125
            name: Spearman Max

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the MoritzLaurer/multilingual-nli-26lang-2mil7 dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Ben vatansızım.',
    'I am stateless.',
    'Kendi tekniğini tercih ediyor.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.0376
spearman_cosine 0.048
pearson_manhattan 0.0347
spearman_manhattan 0.0377
pearson_euclidean 0.037
spearman_euclidean 0.039
pearson_dot 0.0674
spearman_dot 0.0682
pearson_max 0.0674
spearman_max 0.0682

Training Details

Training Dataset

MoritzLaurer/multilingual-nli-26lang-2mil7

  • Dataset: MoritzLaurer/multilingual-nli-26lang-2mil7 at 510a233
  • Size: 25,000 training samples
  • Columns: premise_original, hypothesis_original, score, sentence1, and sentence2
  • Approximate statistics based on the first 1000 samples:
    premise_original hypothesis_original score sentence1 sentence2
    type string string int string string
    details
    • min: 4 tokens
    • mean: 29.3 tokens
    • max: 107 tokens
    • min: 4 tokens
    • mean: 15.62 tokens
    • max: 40 tokens
    • 0: ~34.50%
    • 1: ~33.30%
    • 2: ~32.20%
    • min: 4 tokens
    • mean: 28.28 tokens
    • max: 101 tokens
    • min: 4 tokens
    • mean: 15.39 tokens
    • max: 38 tokens
  • Samples:
    premise_original hypothesis_original score sentence1 sentence2
    N, the total number of LC50 values used in calculating the CV(%) varied with organism and toxicant because some data were rejected due to water hardness, lack of concentration measurements, and/or because some of the LC50s were not calculable. Most discarded data was rejected due to water hardness. 1 N, CV'nin hesaplanmasında kullanılan LC50 değerlerinin toplam sayısı (%) organizma ve toksik madde ile çeşitlidir, çünkü bazı veriler su sertliği, konsantrasyon ölçümlerinin eksikliği ve / veya LC50'lerin bazıları hesaplanamaz olduğu için reddedilmiştir. Atılan verilerin çoğu su sertliği nedeniyle reddedildi.
    As the home of the Venus de Milo and Mona Lisa, the Louvre drew almost unmanageable crowds until President Mitterrand ordered its re-organization in the 1980s. The Louvre is home of the Venus de Milo and Mona Lisa. 0 Venus de Milo ve Mona Lisa'nın evi olarak Louvre, Başkan Mitterrand'ın 1980'lerde yeniden düzenlenmesini emredene kadar neredeyse yönetilemez kalabalıklar çekti. Louvre, Venus de Milo ve Mona Lisa'nın evidir.
    A year ago, the wife of the Oxford don noticed that the pattern on Kleenex quilted tissue uncannily resembled the Penrose Arrowed Rhombi tilings pattern, which Sir Roger had invented--and copyrighted--in 1974. It has been recently found out a similarity between the pattern on the recent Kleenex quilted tissue and the one of the Penrose Arrowed Rhombi tilings. 0 Bir yıl önce Oxford'un karısı, Kleenex kapitone dokudaki desenin 1974'te Sir Roger'ın icat ettiği -ve telif hakkı olan - Penrose Arrowed Rhombi tilings desenine benzediğini fark etti. Yakın zamanda, son Kleenex kapitone dokudaki desen ile Penrose Arrowed Rhombi döşemelerinden biri arasında bir benzerlik bulunmuştur.
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Evaluation Dataset

MoritzLaurer/multilingual-nli-26lang-2mil7

  • Dataset: MoritzLaurer/multilingual-nli-26lang-2mil7 at 510a233
  • Size: 5,000 evaluation samples
  • Columns: premise_original, hypothesis_original, score, sentence1, and sentence2
  • Approximate statistics based on the first 1000 samples:
    premise_original hypothesis_original score sentence1 sentence2
    type string string int string string
    details
    • min: 5 tokens
    • mean: 30.3 tokens
    • max: 99 tokens
    • min: 6 tokens
    • mean: 15.11 tokens
    • max: 56 tokens
    • 0: ~34.50%
    • 1: ~29.90%
    • 2: ~35.60%
    • min: 6 tokens
    • mean: 29.94 tokens
    • max: 106 tokens
    • min: 5 tokens
    • mean: 15.29 tokens
    • max: 52 tokens
  • Samples:
    premise_original hypothesis_original score sentence1 sentence2
    But the racism charge isn't quirky or wacky--it's demagogy. The accusation of prejudice based on a pedestrian kind of hatred. 0 Ama ırkçılık suçlaması tuhaf ya da tuhaf değil, bu bir demagoji. Yaya nefretine dayanan önyargı suçlaması.
    Why would Gates allow the publication of such a book with his byline and photo on the dust jacket? Gates' byline and photo are on the dust jacket 0 Gates neden böyle bir kitabın basılmasına izin versin ki? Gates'in çizgisi ve fotoğrafı toz ceketin üzerinde.
    I am a nonsmoker and allergic to cigarette smoke. I do not smoke. 0 Sigara içmeyen biriyim ve sigara dumanına alerjim var. Sigara içmiyorum.
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 64
  • learning_rate: 2e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • ddp_find_unused_parameters: False

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: False
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss tr_ling_spearman_max
0.0320 25 17.17 - -
0.0639 50 16.4932 - -
0.0959 75 16.5976 - -
0.1279 100 15.6991 - -
0.1598 125 14.876 - -
0.1918 150 14.4828 - -
0.2238 175 12.7061 - -
0.2558 200 10.8687 - -
0.2877 225 8.3797 - -
0.3197 250 6.2029 - -
0.3517 275 5.8228 - -
0.3836 300 5.811 - -
0.4156 325 5.8079 - -
0.4476 350 5.8077 - -
0.4795 375 5.8035 - -
0.5115 400 5.8072 - -
0.5435 425 5.8033 - -
0.5754 450 5.8086 - -
0.6074 475 5.81 - -
0.6394 500 5.7949 - -
0.6714 525 5.8079 - -
0.7033 550 5.8057 - -
0.7353 575 5.8097 - -
0.7673 600 5.7986 - -
0.7992 625 5.8051 - -
0.8312 650 5.8041 - -
0.8632 675 5.7907 - -
0.8951 700 5.7991 - -
0.9271 725 5.8035 - -
0.9591 750 5.7945 - -
0.9910 775 5.8077 - -
1.0 782 - 5.8024 0.0330
1.0230 800 5.6703 - -
1.0550 825 5.8052 - -
1.0870 850 5.7936 - -
1.1189 875 5.7924 - -
1.1509 900 5.7806 - -
1.1829 925 5.7835 - -
1.2148 950 5.7619 - -
1.2468 975 5.8038 - -
1.2788 1000 5.779 - -
1.3107 1025 5.7904 - -
1.3427 1050 5.7696 - -
1.3747 1075 5.7919 - -
1.4066 1100 5.7785 - -
1.4386 1125 5.7862 - -
1.4706 1150 5.7703 - -
1.5026 1175 5.773 - -
1.5345 1200 5.7627 - -
1.5665 1225 5.7596 - -
1.5985 1250 5.7882 - -
1.6304 1275 5.7828 - -
1.6624 1300 5.771 - -
1.6944 1325 5.788 - -
1.7263 1350 5.7719 - -
1.7583 1375 5.7846 - -
1.7903 1400 5.7838 - -
1.8223 1425 5.7912 - -
1.8542 1450 5.7686 - -
1.8862 1475 5.7938 - -
1.9182 1500 5.7847 - -
1.9501 1525 5.7952 - -
1.9821 1550 5.7528 - -
2.0 1564 - 5.7933 0.0682

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.0.0
  • Transformers: 4.41.0
  • PyTorch: 2.3.0+cu121
  • Accelerate: 0.30.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}