Edit model card

SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-small-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 384 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("marroyo777/bge-99GPT-v1")
# Run inference
sentences = [
    'In what context is traffic flow theory typically discussed?',
    'As a result, I was familiar with many terms discussed conceptually but I discovered some of the more official terminology used when discussing traffic flow theory and network control.',
    'There are different types of projects within C.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9987
dot_accuracy 0.0012
manhattan_accuracy 0.9987
euclidean_accuracy 0.9987
max_accuracy 0.9987

Training Details

Training Dataset

Unnamed Dataset

  • Size: 60,341 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 7 tokens
    • mean: 13.77 tokens
    • max: 24 tokens
    • min: 7 tokens
    • mean: 40.26 tokens
    • max: 123 tokens
    • min: 6 tokens
    • mean: 39.24 tokens
    • max: 139 tokens
  • Samples:
    anchor positive negative
    Who is being invited to join the initiative? Our belief is that the research community will be able to gain access to diverse and real-time data with minimal friction, build exciting innovations and make an impact to Data and AI technologies as well. This is just the first release and we are inviting the research community to join us to build exciting data-driven mobility & energy solutions together. Burning it destroys the oil. Once you burn the oil, that particular oil ceases to exist.
    What is the main focus of the research conducted for Orbit? Orbit holds the culmination of almost a year of research with participants from a wide variety of backgrounds, needs, and jobs to be done. So how do you win a hackathon mobility challenge? The SmartRoute team showed two of them.
    What role do LLMs play in HRI's strategy? We are excited about the potential of JournAI to transform mobility. By harnessing the power of LLMs and other AI technologies, HRI is driving towards a more connected, efficient, and sustainable future. This simplified the process for users, who only had to pull and run the docker image to spawn a Jupyterlab app on their machine, open it in their browser, and create a new Pyspark notebook that automatically connected to our spark cluster. Our new workflow allows data science teams to configure their spark jobs and compute resources with options to request memory and CPU from the cluster and customize spark settings.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 15,086 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 6 tokens
    • mean: 13.73 tokens
    • max: 24 tokens
    • min: 6 tokens
    • mean: 39.51 tokens
    • max: 131 tokens
    • min: 6 tokens
    • mean: 36.9 tokens
    • max: 153 tokens
  • Samples:
    anchor positive negative
    What does the text suggest about the balance between creating tools and their practical application? From technology to healthcare, these examples underline the importance of the interplay between theory and practice, between creating advanced tools and applying them effectively. We found success when leaving the later panels empty as opposed to earlier ones. If we established a clear context and pain point for participants, they were often able to fill in a solution and resolution themselves.
    Who are the personas mentioned in the text? Our derived data sets are created based on personas that we have identified and their data access needs. However there still exists a need to connect the map matched nodes that are outputted from the libraries to specific data points from the V2X data, in order to get the rest of the V2X features in a specific time frame.
    Is this the first or second hackathon mentioned? Up next is the first of two hackathons we participated in at Ohio State University. The team did a great job by targeting a pervasive issue in such an intuitive way.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss loss 99GPT-Finetuning-Embedding-test-01_max_accuracy
0.0265 100 0.7653 0.4309 -
0.0530 200 0.4795 0.2525 -
0.0795 300 0.3416 0.1996 -
0.1060 400 0.2713 0.1699 -
0.1326 500 0.2271 0.1558 -
0.1591 600 0.2427 0.1510 -
0.1856 700 0.2188 0.1414 -
0.2121 800 0.1936 0.1350 -
0.2386 900 0.2174 0.1370 -
0.2651 1000 0.2104 0.1265 -
0.2916 1100 0.2142 0.1324 -
0.3181 1200 0.2088 0.1297 -
0.3446 1300 0.1865 0.1240 -
0.3712 1400 0.177 0.1221 -
0.3977 1500 0.1735 0.1296 -
0.4242 1600 0.1746 0.1188 -
0.4507 1700 0.1639 0.1178 -
0.4772 1800 0.1958 0.1105 -
0.5037 1900 0.1874 0.1152 -
0.5302 2000 0.1676 0.1143 -
0.5567 2100 0.1671 0.1067 -
0.5832 2200 0.142 0.1154 -
0.6098 2300 0.1668 0.1150 -
0.6363 2400 0.1605 0.1091 -
0.6628 2500 0.1475 0.1096 -
0.6893 2600 0.1668 0.1066 -
0.7158 2700 0.166 0.1067 -
0.7423 2800 0.1611 0.0999 -
0.7688 2900 0.1747 0.1001 -
0.7953 3000 0.1436 0.1065 -
0.8218 3100 0.1579 0.0992 -
0.8484 3200 0.1718 0.1006 -
0.8749 3300 0.1567 0.0995 -
0.9014 3400 0.1634 0.0954 -
0.9279 3500 0.1441 0.0956 -
0.9544 3600 0.1433 0.0991 -
0.9809 3700 0.1562 0.0931 -
1.0074 3800 0.1421 0.0931 -
1.0339 3900 0.1424 0.0956 -
1.0604 4000 0.128 0.0900 -
1.0870 4100 0.1265 0.0921 -
1.1135 4200 0.1062 0.0944 -
1.1400 4300 0.1221 0.0900 -
1.1665 4400 0.1091 0.0944 -
1.1930 4500 0.091 0.0913 -
1.2195 4600 0.0823 0.0935 -
1.2460 4700 0.0946 0.0949 -
1.2725 4800 0.0803 0.0890 -
1.2990 4900 0.0796 0.0885 -
1.3256 5000 0.0699 0.0921 -
1.3521 5100 0.073 0.0909 -
1.3786 5200 0.0608 0.0934 -
1.4051 5300 0.07 0.0941 -
1.4316 5400 0.0732 0.0896 -
1.4581 5500 0.0639 0.0910 -
1.4846 5600 0.0722 0.0874 -
1.5111 5700 0.0635 0.0925 -
1.5376 5800 0.0631 0.0887 -
1.5642 5900 0.0589 0.0896 -
1.5907 6000 0.0636 0.0925 -
1.6172 6100 0.0702 0.0938 -
1.6437 6200 0.0572 0.0921 -
1.6702 6300 0.0516 0.0946 -
1.6967 6400 0.0695 0.0902 -
1.7232 6500 0.0632 0.0917 -
1.7497 6600 0.0697 0.0832 -
1.7762 6700 0.0747 0.0853 -
1.8028 6800 0.0615 0.0892 -
1.8293 6900 0.0747 0.0855 -
1.8558 7000 0.0668 0.0848 -
1.8823 7100 0.0747 0.0853 -
1.9088 7200 0.0774 0.0847 -
1.9353 7300 0.0546 0.0874 -
1.9618 7400 0.0708 0.0879 -
1.9883 7500 0.0632 0.0863 -
2.0148 7600 0.0601 0.0873 -
2.0414 7700 0.063 0.0870 -
2.0679 7800 0.0646 0.0819 -
2.0944 7900 0.0557 0.0825 -
2.1209 8000 0.0444 0.0841 -
2.1474 8100 0.049 0.0825 -
2.1739 8200 0.0441 0.0845 -
2.2004 8300 0.0451 0.0844 -
2.2269 8400 0.0346 0.0851 -
2.2534 8500 0.0398 0.0847 -
2.2800 8600 0.033 0.0855 -
2.3065 8700 0.0355 0.0851 -
2.3330 8800 0.0313 0.0867 -
2.3595 8900 0.0358 0.0870 -
2.3860 9000 0.0251 0.0867 -
2.4125 9100 0.0395 0.0854 -
2.4390 9200 0.0322 0.0838 -
2.4655 9300 0.0355 0.0847 -
2.4920 9400 0.034 0.0834 -
2.5186 9500 0.0345 0.0862 -
2.5451 9600 0.0272 0.0830 -
2.5716 9700 0.0275 0.0831 -
2.5981 9800 0.0345 0.0849 -
2.6246 9900 0.0289 0.0849 -
2.6511 10000 0.0282 0.0860 -
2.6776 10100 0.0279 0.0885 -
2.7041 10200 0.0344 0.0865 -
2.7306 10300 0.0326 0.0863 -
2.7572 10400 0.0383 0.0840 -
2.7837 10500 0.0338 0.0833 -
2.8102 10600 0.0298 0.0836 -
2.8367 10700 0.0402 0.0825 -
2.8632 10800 0.0361 0.0822 -
2.8897 10900 0.0388 0.0818 -
2.9162 11000 0.0347 0.0821 -
2.9427 11100 0.0341 0.0826 -
2.9692 11200 0.0373 0.0825 -
2.9958 11300 0.0354 0.0824 -
3.0 11316 - - 0.9987

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
2
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for marroyo777/bge-99GPT-v1-test

Finetuned
(107)
this model

Evaluation results

  • Cosine Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.999
  • Dot Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.001
  • Manhattan Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.999
  • Euclidean Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.999
  • Max Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.999