Indonesian-bge-m3 / README.md
MarcoAland's picture
Add new SentenceTransformer model.
0c1c64a verified
metadata
base_model: BAAI/bge-m3
datasets: []
language: []
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - dot_accuracy
  - manhattan_accuracy
  - euclidean_accuracy
  - max_accuracy
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:45000
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: Seorang pria sedang tidur.
    sentences:
      - Seorang pria berambut panjang memegang semacam pita.
      - Seorang pria tidur di sofa di pinggir jalan.
      - Seekor hewan yang mencoba mengeringkan dirinya.
  - source_sentence: Ada beberapa orang yang hadir.
    sentences:
      - Orang tua tidur sendirian di pesawat dengan tas di pangkuannya.
      - >-
        Seorang wanita dengan rambut pirang disanggul dan mengenakan kacamata
        hitam berdiri di dekat tenda hitam dan putih.
      - Tiga peselancar angin di lautan, satu di antaranya sedang mengudara.
  - source_sentence: Ada dua anjing di luar.
    sentences:
      - >-
        Seorang pria mengenakan kemeja berkancing biru dan celana panjang sedang
        tidur di etalase toko.
      - >-
        Seekor anjing putih berjalan melintasi rerumputan berdaun lebat
        sementara seekor anjing coklat hendak menggigitnya.
      - Dua anjing krem ​​​​sedang bermain di salju.
  - source_sentence: >-
      Seorang wanita sedang memainkan gitar di atas panggung dengan latar
      belakang hijau.
    sentences:
      - Warna hijau tidak ada dalam bingkai sama sekali.
      - Seorang wanita dan seorang pria memainkan alat musik di trotoar kota.
      - Wanita itu sedang memainkan musik.
  - source_sentence: Seorang anak laki-laki sedang membaca.
    sentences:
      - >-
        Seorang pria sedang tidur di kursi dan dikelilingi oleh banyak ayam di
        dalam kandang.
      - Seorang anak baru saja memukul bola saat bermain T-ball.
      - >-
        Anak laki-laki kecil duduk di kursi modern yang besar, membaca buku
        anak-anak.
model-index:
  - name: SentenceTransformer based on BAAI/bge-m3
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: model evaluation
          type: model-evaluation
        metrics:
          - type: cosine_accuracy
            value: 0.9596
            name: Cosine Accuracy
          - type: dot_accuracy
            value: 0.0404
            name: Dot Accuracy
          - type: manhattan_accuracy
            value: 0.9592
            name: Manhattan Accuracy
          - type: euclidean_accuracy
            value: 0.9596
            name: Euclidean Accuracy
          - type: max_accuracy
            value: 0.9596
            name: Max Accuracy

SentenceTransformer based on BAAI/bge-m3

This is a sentence-transformers model finetuned from BAAI/bge-m3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-m3
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MarcoAland/Indonesian-bge-m3")
# Run inference
sentences = [
    'Seorang anak laki-laki sedang membaca.',
    'Anak laki-laki kecil duduk di kursi modern yang besar, membaca buku anak-anak.',
    'Seorang anak baru saja memukul bola saat bermain T-ball.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9596
dot_accuracy 0.0404
manhattan_accuracy 0.9592
euclidean_accuracy 0.9596
max_accuracy 0.9596

Training Details

Training Dataset

Unnamed Dataset

  • Size: 45,000 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 6 tokens
    • mean: 10.02 tokens
    • max: 61 tokens
    • min: 5 tokens
    • mean: 16.08 tokens
    • max: 79 tokens
    • min: 5 tokens
    • mean: 16.47 tokens
    • max: 52 tokens
  • Samples:
    anchor positive negative
    Dua pengendara sepeda motor berlomba di lintasan miring. Lintasan pada gambar tidak sepenuhnya datar. Pengendara sepeda motor memakai sarung tangannya sebelum balapan
    Pria itu ada di luar. Seorang pria berpakaian hitam sedang memegang kantong sampah hitam dan memungut barang-barang dari tempat pembuangan tanah. Seorang pria mengenakan jas hitam dikelilingi oleh banyak orang di dalam sebuah gedung dengan patung dada orang di dinding.
    Orang-orang ada di luar ruangan. Ada orang-orang yang menonton band bermain di luar ruangan dan seorang anak berada di latar depan. Dua orang bertopi baseball sedang duduk di dalam ruang kantor besar dan menatap layar komputer.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 5,000 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 4 tokens
    • mean: 9.88 tokens
    • max: 65 tokens
    • min: 6 tokens
    • mean: 16.1 tokens
    • max: 50 tokens
    • min: 5 tokens
    • mean: 16.69 tokens
    • max: 46 tokens
  • Samples:
    anchor positive negative
    Anjing itu sedang berlari. Seekor anjing coklat mengejar bola di rumput Anjing itu berbaring telentang di dekat bola hijau.
    Seorang pria sedang tidur. Seorang pria sedang tidur siang di kereta. Pria muda bekerja di laboratorium sains.
    Seorang pria sedang tidur. Seorang pria sedang tidur di dalam bus. seorang pria mendayung ganilla menyusuri jalan setapak yang berair
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss model-evaluation_max_accuracy
0.0089 100 0.81 0.5528 -
0.0178 200 0.5397 0.4948 -
0.0267 300 0.5349 0.5147 -
0.0356 400 0.5342 0.5475 -
0.0444 500 0.4433 0.5679 0.9596

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.0.1
  • Transformers: 4.42.4
  • PyTorch: 2.3.1+cu121
  • Accelerate: 0.32.1
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}