metadata
language: []
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dataset_size:10K<n<100K
- loss:CosineSimilarityLoss
base_model: distilbert/distilbert-base-uncased
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
widget:
- source_sentence: The long jump pit had to be raked after every few attempts.
sentences:
- The high jumper cleared the bar on his first attempt.
- >-
Chemists use quantum mechanics to predict electron behavior and
molecular bonding.
- >-
Eczema frequently appears as inflamed, tender spots on several parts of
the body.
- source_sentence: Street art transforms empty rural barns into lively murals.
sentences:
- >-
Traditional folk music plays a significant role in preserving a
community's history.
- >-
[SYNTAX] The saxophone offers the high-pitched, thrilling elements in a
jazz trio.
- Atmospheric pressure decreases as you move higher above sea level.
- source_sentence: Proteins are synthesized through the process of translation.
sentences:
- >-
Molecular genetics studies the structure and function of genes at a
molecular level.
- >-
The mathematics lecture is a compelling method for introducing integral
equations.
- >-
The correlation between air pollution and increased mortality rates is
well-documented.
- source_sentence: '[SYNTAX] A barometer is used to measure atmospheric pressure.'
sentences:
- >-
[SYNTAX] Colonialism is a primary subject in several political science
research papers.
- >-
[SYNTAX] Ordinary urban walls are turned into vibrant masterpieces by
street art.
- >-
Email remains a significant device for academic and fictional
correspondence.
- source_sentence: Salinity gradients in oceans affect local wildlife habitats.
sentences:
- >-
The distribution of wildlife in different habitats has fascinated
ecologists for decades.
- >-
[SYNTAX] Bioenergy plants can convert agricultural waste into valuable
electricity.
- Proper management of irrigation schedules is crucial for crop health.
pipeline_tag: sentence-similarity
model-index:
- name: SentenceTransformer based on distilbert/distilbert-base-uncased
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: custom dev
type: custom-dev
metrics:
- type: pearson_cosine
value: 0.9117000984572255
name: Pearson Cosine
- type: spearman_cosine
value: 0.8442193394453843
name: Spearman Cosine
- type: pearson_manhattan
value: 0.9156511082976959
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.8440889792296263
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.9159884478218315
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.8445673615230997
name: Spearman Euclidean
- type: pearson_dot
value: 0.9046139794819923
name: Pearson Dot
- type: spearman_dot
value: 0.8327655787489855
name: Spearman Dot
- type: pearson_max
value: 0.9159884478218315
name: Pearson Max
- type: spearman_max
value: 0.8445673615230997
name: Spearman Max
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: custom test
type: custom-test
metrics:
- type: pearson_cosine
value: 0.919801732989496
name: Pearson Cosine
- type: spearman_cosine
value: 0.8500534773438543
name: Spearman Cosine
- type: pearson_manhattan
value: 0.9282084953416339
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.8493690342081703
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.9284184436823353
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.849759760833697
name: Spearman Euclidean
- type: pearson_dot
value: 0.9141474471982576
name: Pearson Dot
- type: spearman_dot
value: 0.8410969822964006
name: Spearman Dot
- type: pearson_max
value: 0.9284184436823353
name: Pearson Max
- type: spearman_max
value: 0.8500534773438543
name: Spearman Max
SentenceTransformer based on distilbert/distilbert-base-uncased
This is a sentence-transformers model finetuned from distilbert/distilbert-base-uncased. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: distilbert/distilbert-base-uncased
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'Salinity gradients in oceans affect local wildlife habitats.',
'The distribution of wildlife in different habitats has fascinated ecologists for decades.',
'[SYNTAX] Bioenergy plants can convert agricultural waste into valuable electricity.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Dataset:
custom-dev
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.9117 |
spearman_cosine | 0.8442 |
pearson_manhattan | 0.9157 |
spearman_manhattan | 0.8441 |
pearson_euclidean | 0.916 |
spearman_euclidean | 0.8446 |
pearson_dot | 0.9046 |
spearman_dot | 0.8328 |
pearson_max | 0.916 |
spearman_max | 0.8446 |
Semantic Similarity
- Dataset:
custom-test
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.9198 |
spearman_cosine | 0.8501 |
pearson_manhattan | 0.9282 |
spearman_manhattan | 0.8494 |
pearson_euclidean | 0.9284 |
spearman_euclidean | 0.8498 |
pearson_dot | 0.9141 |
spearman_dot | 0.8411 |
pearson_max | 0.9284 |
spearman_max | 0.8501 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 19,352 training samples
- Columns:
s1
,s2
, andlabel
- Approximate statistics based on the first 1000 samples:
s1 s2 label type string string int details - min: 10 tokens
- mean: 19.92 tokens
- max: 42 tokens
- min: 10 tokens
- mean: 20.53 tokens
- max: 42 tokens
- 0: ~50.50%
- 1: ~49.50%
- Samples:
s1 s2 label According to labeling theory, individuals are considered deviant once society has tagged them with that label.
Labeling theory posits that corporations become powerful when labeled as such by stakeholders.
0
Employers must classify workers correctly as either employees or independent contractors to comply with tax and labor laws.
Employers must classify workers correctly as either employees or independent contractors to comply with tax and labor laws.
1
Higher education institutions play a critical role in advancing research and innovation.
Advancement in research and innovation is significantly driven by the contributions of higher education institutions.
1
- Loss:
CosineSimilarityLoss
with these parameters:{ "loss_fct": "torch.nn.modules.loss.MSELoss" }
Evaluation Dataset
Unnamed Dataset
- Size: 2,419 evaluation samples
- Columns:
s1
,s2
, andlabel
- Approximate statistics based on the first 1000 samples:
s1 s2 label type string string int details - min: 11 tokens
- mean: 19.91 tokens
- max: 37 tokens
- min: 11 tokens
- mean: 20.46 tokens
- max: 42 tokens
- 0: ~49.70%
- 1: ~50.30%
- Samples:
s1 s2 label Acoustic tomography is an innovative geophysical technique used to image the Earth's interior.
Acoustic tomography is an innovative geophysical technique used to image the Earth's interior.
1
Urban areas frequently exhibit a different age distribution pattern compared to rural areas.
Urban areas frequently exhibit a different age distribution pattern compared to rural areas.
1
Radiocarbon dating is a critical tool for assessing the duration of battery life in modern electronic devices.
Radiocarbon dating is a critical tool for assessing the duration of battery life in modern electronic devices.
1
- Loss:
CosineSimilarityLoss
with these parameters:{ "loss_fct": "torch.nn.modules.loss.MSELoss" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16num_train_epochs
: 10warmup_ratio
: 0.1fp16
: True
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | loss | custom-dev_spearman_cosine | custom-test_spearman_cosine |
---|---|---|---|---|---|
0.3300 | 100 | 0.2961 | 0.1185 | 0.8063 | - |
0.6601 | 200 | 0.0772 | 0.0504 | 0.8461 | - |
0.9901 | 300 | 0.0502 | 0.0454 | 0.8486 | - |
1.3201 | 400 | 0.0376 | 0.0402 | 0.8481 | - |
1.6502 | 500 | 0.0344 | 0.0400 | 0.8501 | - |
1.9802 | 600 | 0.0329 | 0.0390 | 0.8518 | - |
2.3102 | 700 | 0.0185 | 0.0387 | 0.8496 | - |
2.6403 | 800 | 0.0164 | 0.0371 | 0.8492 | - |
2.9703 | 900 | 0.0179 | 0.0393 | 0.8428 | - |
3.3003 | 1000 | 0.0099 | 0.0389 | 0.8466 | - |
3.6304 | 1100 | 0.0092 | 0.0395 | 0.8480 | - |
3.9604 | 1200 | 0.0101 | 0.0368 | 0.8492 | - |
4.2904 | 1300 | 0.0067 | 0.0385 | 0.8474 | - |
4.6205 | 1400 | 0.0056 | 0.0393 | 0.8456 | - |
4.9505 | 1500 | 0.0068 | 0.0401 | 0.8466 | - |
5.2805 | 1600 | 0.0041 | 0.0410 | 0.8462 | - |
5.6106 | 1700 | 0.0043 | 0.0399 | 0.8469 | - |
5.9406 | 1800 | 0.0039 | 0.0406 | 0.8463 | - |
6.2706 | 1900 | 0.003 | 0.0400 | 0.8456 | - |
6.6007 | 2000 | 0.0026 | 0.0416 | 0.8438 | - |
6.9307 | 2100 | 0.0027 | 0.0420 | 0.8437 | - |
7.2607 | 2200 | 0.0028 | 0.0424 | 0.8449 | - |
7.5908 | 2300 | 0.0021 | 0.0422 | 0.8458 | - |
7.9208 | 2400 | 0.002 | 0.0414 | 0.8451 | - |
8.2508 | 2500 | 0.0015 | 0.0421 | 0.8451 | - |
8.5809 | 2600 | 0.0015 | 0.0427 | 0.8451 | - |
8.9109 | 2700 | 0.0016 | 0.0429 | 0.8444 | - |
9.2409 | 2800 | 0.0011 | 0.0432 | 0.8442 | - |
9.5710 | 2900 | 0.0014 | 0.0432 | 0.8444 | - |
9.9010 | 3000 | 0.0011 | 0.0432 | 0.8442 | - |
10.0 | 3030 | - | - | - | 0.8501 |
Framework Versions
- Python: 3.11.9
- Sentence Transformers: 3.0.0
- Transformers: 4.41.2
- PyTorch: 2.3.0+cu121
- Accelerate: 0.30.1
- Datasets: 2.19.1
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}