metadata
language:
- multilingual
- zh
- ja
- ar
- ko
- de
- fr
- es
- pt
- hi
- id
- it
- tr
- ru
- bn
- ur
- mr
- ta
- vi
- fa
- pl
- uk
- nl
- sv
- he
- sw
- ps
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dataset_size:10K<n<100K
- loss:CoSENTLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
widget:
- source_sentence: Is that wrong?
sentences:
- Is that such a terrible thing?
- Kennedy korkunç bir savcıydı.
- Tom bir davada tanıklık ediyordu.
- source_sentence: Orada mıydılar?
sentences:
- Were they in there?
- İlki ikincisini anlamlı kılar.
- Alerji tedavisi gelişiyor.
- source_sentence: He is not alone
sentences:
- It is not confusing
- The Hawks were humanitarians.
- Tom bir davada tanıklık ediyordu.
- source_sentence: Yaptığın şey bu.
sentences:
- Onurlu işler yapıyorsunuz.
- Weisberg azınlık adına konuştu.
- Robert Ferrigno Kaliforniya'da doğdu.
- source_sentence: Ben vatansızım.
sentences:
- I am stateless.
- Kendi tekniğini tercih ediyor.
- Mermiler camdan fırladı.
pipeline_tag: sentence-similarity
model-index:
- name: >-
SentenceTransformer based on
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: tr ling
type: tr_ling
metrics:
- type: pearson_cosine
value: 0.037604255015168134
name: Pearson Cosine
- type: spearman_cosine
value: 0.04804112988506346
name: Spearman Cosine
- type: pearson_manhattan
value: 0.034740275152181296
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.03769766156967754
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.03698411306484619
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.03903062430281842
name: Spearman Euclidean
- type: pearson_dot
value: 0.0673696846368413
name: Pearson Dot
- type: spearman_dot
value: 0.06818119362900125
name: Spearman Dot
- type: pearson_max
value: 0.0673696846368413
name: Pearson Max
- type: spearman_max
value: 0.06818119362900125
name: Spearman Max
SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the MoritzLaurer/multilingual-nli-26lang-2mil7 dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- Maximum Sequence Length: 128 tokens
- Output Dimensionality: 384 tokens
- Similarity Function: Cosine Similarity
- Training Dataset:
- Languages: multilingual, zh, ja, ar, ko, de, fr, es, pt, hi, id, it, tr, ru, bn, ur, mr, ta, vi, fa, pl, uk, nl, sv, he, sw, ps
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'Ben vatansızım.',
'I am stateless.',
'Kendi tekniğini tercih ediyor.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Dataset:
tr_ling
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.0376 |
spearman_cosine | 0.048 |
pearson_manhattan | 0.0347 |
spearman_manhattan | 0.0377 |
pearson_euclidean | 0.037 |
spearman_euclidean | 0.039 |
pearson_dot | 0.0674 |
spearman_dot | 0.0682 |
pearson_max | 0.0674 |
spearman_max | 0.0682 |
Training Details
Training Dataset
MoritzLaurer/multilingual-nli-26lang-2mil7
- Dataset: MoritzLaurer/multilingual-nli-26lang-2mil7 at 510a233
- Size: 25,000 training samples
- Columns:
premise_original
,hypothesis_original
,score
,sentence1
, andsentence2
- Approximate statistics based on the first 1000 samples:
premise_original hypothesis_original score sentence1 sentence2 type string string int string string details - min: 4 tokens
- mean: 29.3 tokens
- max: 107 tokens
- min: 4 tokens
- mean: 15.62 tokens
- max: 40 tokens
- 0: ~34.50%
- 1: ~33.30%
- 2: ~32.20%
- min: 4 tokens
- mean: 28.28 tokens
- max: 101 tokens
- min: 4 tokens
- mean: 15.39 tokens
- max: 38 tokens
- Samples:
premise_original hypothesis_original score sentence1 sentence2 N, the total number of LC50 values used in calculating the CV(%) varied with organism and toxicant because some data were rejected due to water hardness, lack of concentration measurements, and/or because some of the LC50s were not calculable.
Most discarded data was rejected due to water hardness.
1
N, CV'nin hesaplanmasında kullanılan LC50 değerlerinin toplam sayısı (%) organizma ve toksik madde ile çeşitlidir, çünkü bazı veriler su sertliği, konsantrasyon ölçümlerinin eksikliği ve / veya LC50'lerin bazıları hesaplanamaz olduğu için reddedilmiştir.
Atılan verilerin çoğu su sertliği nedeniyle reddedildi.
As the home of the Venus de Milo and Mona Lisa, the Louvre drew almost unmanageable crowds until President Mitterrand ordered its re-organization in the 1980s.
The Louvre is home of the Venus de Milo and Mona Lisa.
0
Venus de Milo ve Mona Lisa'nın evi olarak Louvre, Başkan Mitterrand'ın 1980'lerde yeniden düzenlenmesini emredene kadar neredeyse yönetilemez kalabalıklar çekti.
Louvre, Venus de Milo ve Mona Lisa'nın evidir.
A year ago, the wife of the Oxford don noticed that the pattern on Kleenex quilted tissue uncannily resembled the Penrose Arrowed Rhombi tilings pattern, which Sir Roger had invented--and copyrighted--in 1974.
It has been recently found out a similarity between the pattern on the recent Kleenex quilted tissue and the one of the Penrose Arrowed Rhombi tilings.
0
Bir yıl önce Oxford'un karısı, Kleenex kapitone dokudaki desenin 1974'te Sir Roger'ın icat ettiği -ve telif hakkı olan - Penrose Arrowed Rhombi tilings desenine benzediğini fark etti.
Yakın zamanda, son Kleenex kapitone dokudaki desen ile Penrose Arrowed Rhombi döşemelerinden biri arasında bir benzerlik bulunmuştur.
- Loss:
CoSENTLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "pairwise_cos_sim" }
Evaluation Dataset
MoritzLaurer/multilingual-nli-26lang-2mil7
- Dataset: MoritzLaurer/multilingual-nli-26lang-2mil7 at 510a233
- Size: 5,000 evaluation samples
- Columns:
premise_original
,hypothesis_original
,score
,sentence1
, andsentence2
- Approximate statistics based on the first 1000 samples:
premise_original hypothesis_original score sentence1 sentence2 type string string int string string details - min: 5 tokens
- mean: 30.3 tokens
- max: 99 tokens
- min: 6 tokens
- mean: 15.11 tokens
- max: 56 tokens
- 0: ~34.50%
- 1: ~29.90%
- 2: ~35.60%
- min: 6 tokens
- mean: 29.94 tokens
- max: 106 tokens
- min: 5 tokens
- mean: 15.29 tokens
- max: 52 tokens
- Samples:
premise_original hypothesis_original score sentence1 sentence2 But the racism charge isn't quirky or wacky--it's demagogy.
The accusation of prejudice based on a pedestrian kind of hatred.
0
Ama ırkçılık suçlaması tuhaf ya da tuhaf değil, bu bir demagoji.
Yaya nefretine dayanan önyargı suçlaması.
Why would Gates allow the publication of such a book with his byline and photo on the dust jacket?
Gates' byline and photo are on the dust jacket
0
Gates neden böyle bir kitabın basılmasına izin versin ki?
Gates'in çizgisi ve fotoğrafı toz ceketin üzerinde.
I am a nonsmoker and allergic to cigarette smoke.
I do not smoke.
0
Sigara içmeyen biriyim ve sigara dumanına alerjim var.
Sigara içmiyorum.
- Loss:
CoSENTLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "pairwise_cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epochper_device_train_batch_size
: 32per_device_eval_batch_size
: 64learning_rate
: 2e-05num_train_epochs
: 5warmup_ratio
: 0.1fp16
: Trueload_best_model_at_end
: Trueddp_find_unused_parameters
: False
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: epochprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 64per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 5max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Trueignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Falseddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | loss | tr_ling_spearman_max |
---|---|---|---|---|
0.0320 | 25 | 17.17 | - | - |
0.0639 | 50 | 16.4932 | - | - |
0.0959 | 75 | 16.5976 | - | - |
0.1279 | 100 | 15.6991 | - | - |
0.1598 | 125 | 14.876 | - | - |
0.1918 | 150 | 14.4828 | - | - |
0.2238 | 175 | 12.7061 | - | - |
0.2558 | 200 | 10.8687 | - | - |
0.2877 | 225 | 8.3797 | - | - |
0.3197 | 250 | 6.2029 | - | - |
0.3517 | 275 | 5.8228 | - | - |
0.3836 | 300 | 5.811 | - | - |
0.4156 | 325 | 5.8079 | - | - |
0.4476 | 350 | 5.8077 | - | - |
0.4795 | 375 | 5.8035 | - | - |
0.5115 | 400 | 5.8072 | - | - |
0.5435 | 425 | 5.8033 | - | - |
0.5754 | 450 | 5.8086 | - | - |
0.6074 | 475 | 5.81 | - | - |
0.6394 | 500 | 5.7949 | - | - |
0.6714 | 525 | 5.8079 | - | - |
0.7033 | 550 | 5.8057 | - | - |
0.7353 | 575 | 5.8097 | - | - |
0.7673 | 600 | 5.7986 | - | - |
0.7992 | 625 | 5.8051 | - | - |
0.8312 | 650 | 5.8041 | - | - |
0.8632 | 675 | 5.7907 | - | - |
0.8951 | 700 | 5.7991 | - | - |
0.9271 | 725 | 5.8035 | - | - |
0.9591 | 750 | 5.7945 | - | - |
0.9910 | 775 | 5.8077 | - | - |
1.0 | 782 | - | 5.8024 | 0.0330 |
1.0230 | 800 | 5.6703 | - | - |
1.0550 | 825 | 5.8052 | - | - |
1.0870 | 850 | 5.7936 | - | - |
1.1189 | 875 | 5.7924 | - | - |
1.1509 | 900 | 5.7806 | - | - |
1.1829 | 925 | 5.7835 | - | - |
1.2148 | 950 | 5.7619 | - | - |
1.2468 | 975 | 5.8038 | - | - |
1.2788 | 1000 | 5.779 | - | - |
1.3107 | 1025 | 5.7904 | - | - |
1.3427 | 1050 | 5.7696 | - | - |
1.3747 | 1075 | 5.7919 | - | - |
1.4066 | 1100 | 5.7785 | - | - |
1.4386 | 1125 | 5.7862 | - | - |
1.4706 | 1150 | 5.7703 | - | - |
1.5026 | 1175 | 5.773 | - | - |
1.5345 | 1200 | 5.7627 | - | - |
1.5665 | 1225 | 5.7596 | - | - |
1.5985 | 1250 | 5.7882 | - | - |
1.6304 | 1275 | 5.7828 | - | - |
1.6624 | 1300 | 5.771 | - | - |
1.6944 | 1325 | 5.788 | - | - |
1.7263 | 1350 | 5.7719 | - | - |
1.7583 | 1375 | 5.7846 | - | - |
1.7903 | 1400 | 5.7838 | - | - |
1.8223 | 1425 | 5.7912 | - | - |
1.8542 | 1450 | 5.7686 | - | - |
1.8862 | 1475 | 5.7938 | - | - |
1.9182 | 1500 | 5.7847 | - | - |
1.9501 | 1525 | 5.7952 | - | - |
1.9821 | 1550 | 5.7528 | - | - |
2.0 | 1564 | - | 5.7933 | 0.0682 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.0.0
- Transformers: 4.41.0
- PyTorch: 2.3.0+cu121
- Accelerate: 0.30.1
- Datasets: 2.19.1
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
CoSENTLoss
@online{kexuefm-8847,
title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
author={Su Jianlin},
year={2022},
month={Jan},
url={https://kexue.fm/archives/8847},
}