metadata
base_model: BAAI/bge-m3
datasets: []
language: []
library_name: sentence-transformers
metrics:
- cosine_accuracy
- dot_accuracy
- manhattan_accuracy
- euclidean_accuracy
- max_accuracy
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:45000
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: Seorang pria sedang tidur.
sentences:
- Seorang pria berambut panjang memegang semacam pita.
- Seorang pria tidur di sofa di pinggir jalan.
- Seekor hewan yang mencoba mengeringkan dirinya.
- source_sentence: Ada beberapa orang yang hadir.
sentences:
- Orang tua tidur sendirian di pesawat dengan tas di pangkuannya.
- >-
Seorang wanita dengan rambut pirang disanggul dan mengenakan kacamata
hitam berdiri di dekat tenda hitam dan putih.
- Tiga peselancar angin di lautan, satu di antaranya sedang mengudara.
- source_sentence: Ada dua anjing di luar.
sentences:
- >-
Seorang pria mengenakan kemeja berkancing biru dan celana panjang sedang
tidur di etalase toko.
- >-
Seekor anjing putih berjalan melintasi rerumputan berdaun lebat
sementara seekor anjing coklat hendak menggigitnya.
- Dua anjing krem ​​​​sedang bermain di salju.
- source_sentence: >-
Seorang wanita sedang memainkan gitar di atas panggung dengan latar
belakang hijau.
sentences:
- Warna hijau tidak ada dalam bingkai sama sekali.
- Seorang wanita dan seorang pria memainkan alat musik di trotoar kota.
- Wanita itu sedang memainkan musik.
- source_sentence: Seorang anak laki-laki sedang membaca.
sentences:
- >-
Seorang pria sedang tidur di kursi dan dikelilingi oleh banyak ayam di
dalam kandang.
- Seorang anak baru saja memukul bola saat bermain T-ball.
- >-
Anak laki-laki kecil duduk di kursi modern yang besar, membaca buku
anak-anak.
model-index:
- name: SentenceTransformer based on BAAI/bge-m3
results:
- task:
type: triplet
name: Triplet
dataset:
name: model evaluation
type: model-evaluation
metrics:
- type: cosine_accuracy
value: 0.9596
name: Cosine Accuracy
- type: dot_accuracy
value: 0.0404
name: Dot Accuracy
- type: manhattan_accuracy
value: 0.9592
name: Manhattan Accuracy
- type: euclidean_accuracy
value: 0.9596
name: Euclidean Accuracy
- type: max_accuracy
value: 0.9596
name: Max Accuracy
SentenceTransformer based on BAAI/bge-m3
This is a sentence-transformers model finetuned from BAAI/bge-m3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-m3
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 1024 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("MarcoAland/Indonesian-bge-m3")
# Run inference
sentences = [
'Seorang anak laki-laki sedang membaca.',
'Anak laki-laki kecil duduk di kursi modern yang besar, membaca buku anak-anak.',
'Seorang anak baru saja memukul bola saat bermain T-ball.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Triplet
- Dataset:
model-evaluation
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.9596 |
dot_accuracy | 0.0404 |
manhattan_accuracy | 0.9592 |
euclidean_accuracy | 0.9596 |
max_accuracy | 0.9596 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 45,000 training samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 1000 samples:
anchor positive negative type string string string details - min: 6 tokens
- mean: 10.02 tokens
- max: 61 tokens
- min: 5 tokens
- mean: 16.08 tokens
- max: 79 tokens
- min: 5 tokens
- mean: 16.47 tokens
- max: 52 tokens
- Samples:
anchor positive negative Dua pengendara sepeda motor berlomba di lintasan miring.
Lintasan pada gambar tidak sepenuhnya datar.
Pengendara sepeda motor memakai sarung tangannya sebelum balapan
Pria itu ada di luar.
Seorang pria berpakaian hitam sedang memegang kantong sampah hitam dan memungut barang-barang dari tempat pembuangan tanah.
Seorang pria mengenakan jas hitam dikelilingi oleh banyak orang di dalam sebuah gedung dengan patung dada orang di dinding.
Orang-orang ada di luar ruangan.
Ada orang-orang yang menonton band bermain di luar ruangan dan seorang anak berada di latar depan.
Dua orang bertopi baseball sedang duduk di dalam ruang kantor besar dan menatap layar komputer.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset
Unnamed Dataset
- Size: 5,000 evaluation samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 1000 samples:
anchor positive negative type string string string details - min: 4 tokens
- mean: 9.88 tokens
- max: 65 tokens
- min: 6 tokens
- mean: 16.1 tokens
- max: 50 tokens
- min: 5 tokens
- mean: 16.69 tokens
- max: 46 tokens
- Samples:
anchor positive negative Anjing itu sedang berlari.
Seekor anjing coklat mengejar bola di rumput
Anjing itu berbaring telentang di dekat bola hijau.
Seorang pria sedang tidur.
Seorang pria sedang tidur siang di kereta.
Pria muda bekerja di laboratorium sains.
Seorang pria sedang tidur.
Seorang pria sedang tidur di dalam bus.
seorang pria mendayung ganilla menyusuri jalan setapak yang berair
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 4per_device_eval_batch_size
: 4num_train_epochs
: 1warmup_ratio
: 0.1batch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 4per_device_eval_batch_size
: 4per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falsebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | loss | model-evaluation_max_accuracy |
---|---|---|---|---|
0.0089 | 100 | 0.81 | 0.5528 | - |
0.0178 | 200 | 0.5397 | 0.4948 | - |
0.0267 | 300 | 0.5349 | 0.5147 | - |
0.0356 | 400 | 0.5342 | 0.5475 | - |
0.0444 | 500 | 0.4433 | 0.5679 | 0.9596 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.1+cu121
- Accelerate: 0.32.1
- Datasets: 2.20.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}