metadata

language: []
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dataset_size:10K<n<100K
  - loss:CosineSimilarityLoss
base_model: distilbert/distilbert-base-uncased
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
widget:
  - source_sentence: The long jump pit had to be raked after every few attempts.
    sentences:
      - The high jumper cleared the bar on his first attempt.
      - >-
        Chemists use quantum mechanics to predict electron behavior and
        molecular bonding.
      - >-
        Eczema frequently appears as inflamed, tender spots on several parts of
        the body.
  - source_sentence: Street art transforms empty rural barns into lively murals.
    sentences:
      - >-
        Traditional folk music plays a significant role in preserving a
        community's history.
      - >-
        [SYNTAX] The saxophone offers the high-pitched, thrilling elements in a
        jazz trio.
      - Atmospheric pressure decreases as you move higher above sea level.
  - source_sentence: Proteins are synthesized through the process of translation.
    sentences:
      - >-
        Molecular genetics studies the structure and function of genes at a
        molecular level.
      - >-
        The mathematics lecture is a compelling method for introducing integral
        equations.
      - >-
        The correlation between air pollution and increased mortality rates is
        well-documented.  
  - source_sentence: '[SYNTAX] A barometer is used to measure atmospheric pressure.'
    sentences:
      - >-
        [SYNTAX] Colonialism is a primary subject in several political science
        research papers.
      - >-
        [SYNTAX] Ordinary urban walls are turned into vibrant masterpieces by
        street art.
      - >-
        Email remains a significant device for academic and fictional
        correspondence.
  - source_sentence: Salinity gradients in oceans affect local wildlife habitats.
    sentences:
      - >-
        The distribution of wildlife in different habitats has fascinated
        ecologists for decades.
      - >-
        [SYNTAX] Bioenergy plants can convert agricultural waste into valuable
        electricity.
      - Proper management of irrigation schedules is crucial for crop health.
pipeline_tag: sentence-similarity
model-index:
  - name: SentenceTransformer based on distilbert/distilbert-base-uncased
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: custom dev
          type: custom-dev
        metrics:
          - type: pearson_cosine
            value: 0.9117000984572255
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8442193394453843
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.9156511082976959
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8440889792296263
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.9159884478218315
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8445673615230997
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.9046139794819923
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8327655787489855
            name: Spearman Dot
          - type: pearson_max
            value: 0.9159884478218315
            name: Pearson Max
          - type: spearman_max
            value: 0.8445673615230997
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: custom test
          type: custom-test
        metrics:
          - type: pearson_cosine
            value: 0.919801732989496
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8500534773438543
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.9282084953416339
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8493690342081703
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.9284184436823353
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.849759760833697
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.9141474471982576
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8410969822964006
            name: Spearman Dot
          - type: pearson_max
            value: 0.9284184436823353
            name: Pearson Max
          - type: spearman_max
            value: 0.8500534773438543
            name: Spearman Max

SentenceTransformer based on distilbert/distilbert-base-uncased

This is a sentence-transformers model finetuned from distilbert/distilbert-base-uncased. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: distilbert/distilbert-base-uncased
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Salinity gradients in oceans affect local wildlife habitats.',
    'The distribution of wildlife in different habitats has fascinated ecologists for decades.',
    '[SYNTAX] Bioenergy plants can convert agricultural waste into valuable electricity.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Dataset: custom-dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.9117
spearman_cosine	0.8442
pearson_manhattan	0.9157
spearman_manhattan	0.8441
pearson_euclidean	0.916
spearman_euclidean	0.8446
pearson_dot	0.9046
spearman_dot	0.8328
pearson_max	0.916
spearman_max	0.8446

Semantic Similarity

Dataset: custom-test
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.9198
spearman_cosine	0.8501
pearson_manhattan	0.9282
spearman_manhattan	0.8494
pearson_euclidean	0.9284
spearman_euclidean	0.8498
pearson_dot	0.9141
spearman_dot	0.8411
pearson_max	0.9284
spearman_max	0.8501

Training Details

Training Dataset

Unnamed Dataset

Size: 19,352 training samples
Columns: s1, s2, and label
Approximate statistics based on the first 1000 samples:
s1 s2 label
type string string int
details
min: 10 tokens
mean: 19.92 tokens
max: 42 tokens

min: 10 tokens
mean: 20.53 tokens
max: 42 tokens

0: ~50.50%
1: ~49.50%

	s1	s2	label
type	string	string	int
details	min: 10 tokens mean: 19.92 tokens max: 42 tokens	min: 10 tokens mean: 20.53 tokens max: 42 tokens	0: ~50.50% 1: ~49.50%

Samples:

s1	s2	label
`According to labeling theory, individuals are considered deviant once society has tagged them with that label.`	`Labeling theory posits that corporations become powerful when labeled as such by stakeholders.`	`0`
`Employers must classify workers correctly as either employees or independent contractors to comply with tax and labor laws.`	`Employers must classify workers correctly as either employees or independent contractors to comply with tax and labor laws.`	`1`
`Higher education institutions play a critical role in advancing research and innovation.`	`Advancement in research and innovation is significantly driven by the contributions of higher education institutions.`	`1`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Evaluation Dataset

Unnamed Dataset

Size: 2,419 evaluation samples
Columns: s1, s2, and label
Approximate statistics based on the first 1000 samples:
s1 s2 label
type string string int
details
min: 11 tokens
mean: 19.91 tokens
max: 37 tokens

min: 11 tokens
mean: 20.46 tokens
max: 42 tokens

0: ~49.70%
1: ~50.30%

	s1	s2	label
type	string	string	int
details	min: 11 tokens mean: 19.91 tokens max: 37 tokens	min: 11 tokens mean: 20.46 tokens max: 42 tokens	0: ~49.70% 1: ~50.30%

Samples:

s1	s2	label
`Acoustic tomography is an innovative geophysical technique used to image the Earth's interior.`	`Acoustic tomography is an innovative geophysical technique used to image the Earth's interior.`	`1`
`Urban areas frequently exhibit a different age distribution pattern compared to rural areas.`	`Urban areas frequently exhibit a different age distribution pattern compared to rural areas.`	`1`
`Radiocarbon dating is a critical tool for assessing the duration of battery life in modern electronic devices.`	`Radiocarbon dating is a critical tool for assessing the duration of battery life in modern electronic devices.`	`1`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 10
warmup_ratio: 0.1
fp16: True

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 10
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	loss	custom-dev_spearman_cosine	custom-test_spearman_cosine
0.3300	100	0.2961	0.1185	0.8063	-
0.6601	200	0.0772	0.0504	0.8461	-
0.9901	300	0.0502	0.0454	0.8486	-
1.3201	400	0.0376	0.0402	0.8481	-
1.6502	500	0.0344	0.0400	0.8501	-
1.9802	600	0.0329	0.0390	0.8518	-
2.3102	700	0.0185	0.0387	0.8496	-
2.6403	800	0.0164	0.0371	0.8492	-
2.9703	900	0.0179	0.0393	0.8428	-
3.3003	1000	0.0099	0.0389	0.8466	-
3.6304	1100	0.0092	0.0395	0.8480	-
3.9604	1200	0.0101	0.0368	0.8492	-
4.2904	1300	0.0067	0.0385	0.8474	-
4.6205	1400	0.0056	0.0393	0.8456	-
4.9505	1500	0.0068	0.0401	0.8466	-
5.2805	1600	0.0041	0.0410	0.8462	-
5.6106	1700	0.0043	0.0399	0.8469	-
5.9406	1800	0.0039	0.0406	0.8463	-
6.2706	1900	0.003	0.0400	0.8456	-
6.6007	2000	0.0026	0.0416	0.8438	-
6.9307	2100	0.0027	0.0420	0.8437	-
7.2607	2200	0.0028	0.0424	0.8449	-
7.5908	2300	0.0021	0.0422	0.8458	-
7.9208	2400	0.002	0.0414	0.8451	-
8.2508	2500	0.0015	0.0421	0.8451	-
8.5809	2600	0.0015	0.0427	0.8451	-
8.9109	2700	0.0016	0.0429	0.8444	-
9.2409	2800	0.0011	0.0432	0.8442	-
9.5710	2900	0.0014	0.0432	0.8444	-
9.9010	3000	0.0011	0.0432	0.8442	-
10.0	3030	-	-	-	0.8501

Framework Versions

Python: 3.11.9
Sentence Transformers: 3.0.0
Transformers: 4.41.2
PyTorch: 2.3.0+cu121
Accelerate: 0.30.1
Datasets: 2.19.1
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}