nomic-embed-text-v1 / README.md
deman539's picture
Update README.md
116050f verified
metadata
base_model: nomic-ai/nomic-embed-text-v1
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
  - dot_accuracy@1
  - dot_accuracy@3
  - dot_accuracy@5
  - dot_accuracy@10
  - dot_precision@1
  - dot_precision@3
  - dot_precision@5
  - dot_precision@10
  - dot_recall@1
  - dot_recall@3
  - dot_recall@5
  - dot_recall@10
  - dot_ndcg@10
  - dot_mrr@10
  - dot_map@100
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:2459
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      What types of applications may require confidentiality during their
      launch?
    sentences:
      - >-
        Taken together, the technical protections and practices laid out in the
        Blueprint for an AI Bill of Rights can help 

        guard the American public against many of the potential and actual harms
        identified by researchers, technolo­

        gists, advocates, journalists, policymakers, and communities in the
        United States and around the world. This 

        technical companion is intended to be used as a reference by people
        across many circumstances  anyone
      - >-
        deactivate AI systems that demonstrate performance or outcomes
        inconsistent with intended use. 

        Action ID 

        Suggested Action 

        GAI Risks 

        MG-2.4-001 

        Establish and maintain communication plans to inform AI stakeholders as
        part of 

        the deactivation or disengagement process of a specific GAI system
        (including for 

        open-source models) or context of use, including reasons, workarounds,
        user 

        access removal, alternative processes, contact information, etc. 

        Human-AI Configuration
      - >-
        launch may need to be confidential. Government applications,
        particularly law enforcement applications or 

        applications that raise national security considerations, may require
        confidential or limited engagement based 

        on system sensitivities and preexisting oversight laws and structures.
        Concerns raised in this consultation 

        should be documented, and the automated system developers were proposing
        to create, use, or deploy should 

        be reconsidered based on this feedback.
  - source_sentence: >-
      What is the main focus of the paper by Chandra et al. (2023) regarding
      Chinese influence operations?
    sentences:
      - >-
        https://arxiv.org/abs/2403.06634 

        Chandra, B. et al. (2023) Dismantling the Disinformation Business of
        Chinese Influence Operations. 

        RAND.
        https://www.rand.org/pubs/commentary/2023/10/dismantling-the-disinformation-business-of-

        chinese.html 

        Ciriello, R. et al. (2024) Ethical Tensions in Human-AI Companionship: A
        Dialectical Inquiry into Replika. 

        ResearchGate.
        https://www.researchgate.net/publication/374505266_Ethical_Tensions_in_Human-

        AI_Companionship_A_Dialectical_Inquiry_into_Replika
      - >-
        monocultures,3” resulting from repeated use of the same model, or
        impacts on access to 

        opportunity, labor markets, and the creative economies.4 

         

        Source of risk: Risks may emerge from factors related to the design,
        training, or operation of the 

        GAI model itself, stemming in some cases from GAI model or system
        inputs, and in other cases, 

        from GAI system outputs. Many GAI risks, however, originate from human
        behavior, including
      - >-
        limited to GAI model or system architecture, training mechanisms and
        libraries, data types used for 

        training or fine-tuning, levels of model access or availability of model
        weights, and application or use 

        case context. 

        Organizations may choose to tailor how they measure GAI risks based on
        these characteristics. They may 

        additionally wish to allocate risk management resources relative to the
        severity and likelihood of
  - source_sentence: >-
      What steps are being taken to enhance transparency and accountability in
      the GAI system?
    sentences:
      - >-
        security, health, foreign relations, the environment, and the
        technological recovery and use of resources, among 

        other topics. OSTP leads interagency science and technology policy
        coordination efforts, assists the Office of 

        Management and Budget (OMB) with an annual review and analysis of
        Federal research and development in 

        budgets, and serves as a source of scientific and technological analysis
        and judgment for the President with
      - >-
        steps taken to update the GAI system to enhance transparency and 

        accountability. 

        Human-AI Configuration; Harmful 

        Bias and Homogenization 

        MG-4.1-006 

        Track dataset modifications for provenance by monitoring data deletions, 

        rectification requests, and other changes that may impact the
        verifiability of 

        content origins. 

        Information Integrity
      - >-
        content. Some well-known techniques for provenance data tracking include
        digital watermarking, 

        metadata recording, digital fingerprinting, and human authentication,
        among others. 

        Provenance Data Tracking Approaches 

        Provenance data tracking techniques for GAI systems can be used to track
        the history and origin of data 

        inputs, metadata, and synthetic content. Provenance data tracking
        records the origin and history for
  - source_sentence: >-
      What are some examples of mechanisms for human consideration and fallback
      mentioned in the context?
    sentences:
      - >-
        consequences resulting from the utilization of content provenance
        approaches on users and 

        communities. Furthermore, organizations can track and document the
        provenance of datasets to identify 

        instances in which AI-generated data is a potential root cause of
        performance issues with the GAI 

        system. 

        A.1.8. Incident Disclosure 

        Overview 

        AI incidents can be defined as an “event, circumstance, or series of
        events where the development, use,
      - >-
        fully impact rights, opportunities, or access. Automated systems that
        have greater control over outcomes, 

        provide input to high-stakes decisions, relate to sensitive domains, or
        otherwise have a greater potential to 

        meaningfully impact rights, opportunities, or access should have greater
        availability (e.g., staffing) and over­

        sight of human consideration and fallback mechanisms. 

        Accessible. Mechanisms for human consideration and fallback, whether
        in-person, on paper, by phone, or
      - >-


        Frida Polli, CEO, Pymetrics

        

        Karen Levy, Assistant Professor, Department of Information Science,
        Cornell University

        

        Natasha Duarte, Project Director, Upturn

        

        Elana Zeide, Assistant Professor, University of Nebraska College of Law

        

        Fabian Rogers, Constituent Advocate, Office of NY State Senator Jabari
        Brisport and Community

        Advocate and Floor Captain, Atlantic Plaza Towers Tenants Association
  - source_sentence: >-
      What mental health issues are associated with the increased use of
      technologies in schools and workplaces?
    sentences:
      - >-
        but this approach may still produce harmful recommendations in response
        to other less-explicit, novel 

        prompts (also relevant to CBRN Information or Capabilities, Data
        Privacy, Information Security, and 

        Obscene, Degrading and/or Abusive Content). Crafting such prompts
        deliberately is known as 

        “jailbreaking,” or, manipulating prompts to circumvent output controls.
        Limitations of GAI systems can be
      - >-
        external use, narrow vs. broad application scope, fine-tuning, and
        varieties of 

        data sources (e.g., grounding, retrieval-augmented generation). 

        Data Privacy; Intellectual 

        Property
      - >-
        technologies has increased in schools and workplaces, and, when coupled
        with consequential management and 

        evaluation decisions, it is leading to mental health harms such as
        lowered self-confidence, anxiety, depression, and 

        a reduced ability to use analytical reasoning.61 Documented patterns
        show that personal data is being aggregated by 

        data brokers to profile communities in harmful ways.62 The impact of all
        this data harvesting is corrosive,
model-index:
  - name: SentenceTransformer based on nomic-ai/nomic-embed-text-v1
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.8584142394822006
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.9838187702265372
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9951456310679612
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9991909385113269
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.8584142394822006
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.32793959007551243
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1990291262135922
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09991909385113268
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.8584142394822006
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.9838187702265372
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.9951456310679612
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.9991909385113269
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.9417951214306157
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.9220443571171728
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.9221065926163013
            name: Cosine Map@100
          - type: dot_accuracy@1
            value: 0.8584142394822006
            name: Dot Accuracy@1
          - type: dot_accuracy@3
            value: 0.9838187702265372
            name: Dot Accuracy@3
          - type: dot_accuracy@5
            value: 0.9951456310679612
            name: Dot Accuracy@5
          - type: dot_accuracy@10
            value: 0.9991909385113269
            name: Dot Accuracy@10
          - type: dot_precision@1
            value: 0.8584142394822006
            name: Dot Precision@1
          - type: dot_precision@3
            value: 0.32793959007551243
            name: Dot Precision@3
          - type: dot_precision@5
            value: 0.1990291262135922
            name: Dot Precision@5
          - type: dot_precision@10
            value: 0.09991909385113268
            name: Dot Precision@10
          - type: dot_recall@1
            value: 0.8584142394822006
            name: Dot Recall@1
          - type: dot_recall@3
            value: 0.9838187702265372
            name: Dot Recall@3
          - type: dot_recall@5
            value: 0.9951456310679612
            name: Dot Recall@5
          - type: dot_recall@10
            value: 0.9991909385113269
            name: Dot Recall@10
          - type: dot_ndcg@10
            value: 0.9417951214306157
            name: Dot Ndcg@10
          - type: dot_mrr@10
            value: 0.9220443571171728
            name: Dot Mrr@10
          - type: dot_map@100
            value: 0.9221065926163013
            name: Dot Map@100

SentenceTransformer based on nomic-ai/nomic-embed-text-v1

This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. In particular, this model is trained on various documents which descibe frameworks for building ethical AI systems. As such it performs well on matching questions to context in RAG applications.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/nomic-embed-text-v1
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("deman539/nomic-embed-text-v1")
# Run inference
sentences = [
    'What mental health issues are associated with the increased use of technologies in schools and workplaces?',
    'technologies has increased in schools and workplaces, and, when coupled with consequential management and \nevaluation decisions, it is leading to mental health harms such as lowered self-confidence, anxiety, depression, and \na reduced ability to use analytical reasoning.61 Documented patterns show that personal data is being aggregated by \ndata brokers to profile communities in harmful ways.62 The impact of all this data harvesting is corrosive,',
    'but this approach may still produce harmful recommendations in response to other less-explicit, novel \nprompts (also relevant to CBRN Information or Capabilities, Data Privacy, Information Security, and \nObscene, Degrading and/or Abusive Content). Crafting such prompts deliberately is known as \n“jailbreaking,” or, manipulating prompts to circumvent output controls. Limitations of GAI systems can be',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.8584
cosine_accuracy@3 0.9838
cosine_accuracy@5 0.9951
cosine_accuracy@10 0.9992
cosine_precision@1 0.8584
cosine_precision@3 0.3279
cosine_precision@5 0.199
cosine_precision@10 0.0999
cosine_recall@1 0.8584
cosine_recall@3 0.9838
cosine_recall@5 0.9951
cosine_recall@10 0.9992
cosine_ndcg@10 0.9418
cosine_mrr@10 0.922
cosine_map@100 0.9221
dot_accuracy@1 0.8584
dot_accuracy@3 0.9838
dot_accuracy@5 0.9951
dot_accuracy@10 0.9992
dot_precision@1 0.8584
dot_precision@3 0.3279
dot_precision@5 0.199
dot_precision@10 0.0999
dot_recall@1 0.8584
dot_recall@3 0.9838
dot_recall@5 0.9951
dot_recall@10 0.9992
dot_ndcg@10 0.9418
dot_mrr@10 0.922
dot_map@100 0.9221

Training Details

Training Dataset

Unnamed Dataset

  • Size: 2,459 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 2 tokens
    • mean: 18.7 tokens
    • max: 35 tokens
    • min: 22 tokens
    • mean: 93.19 tokens
    • max: 337 tokens
  • Samples:
    sentence_0 sentence_1
    What should organizations include in contracts to evaluate third-party GAI processes and standards? services acquisition and value chain risk management; and legal compliance.
    Data Privacy; Information
    Integrity; Information Security;
    Intellectual Property; Value Chain
    and Component Integration
    GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party
    GAI processes and standards.
    Information Integrity
    GV-6.1-007 Inventory all third-party entities with access to organizational content and
    establish approved GAI technology and service provider lists.
    What steps should be taken to manage third-party entities with access to organizational content? services acquisition and value chain risk management; and legal compliance.
    Data Privacy; Information
    Integrity; Information Security;
    Intellectual Property; Value Chain
    and Component Integration
    GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party
    GAI processes and standards.
    Information Integrity
    GV-6.1-007 Inventory all third-party entities with access to organizational content and
    establish approved GAI technology and service provider lists.
    What should entities responsible for automated systems establish before deploying the system? Clear organizational oversight. Entities responsible for the development or use of automated systems
    should lay out clear governance structures and procedures. This includes clearly-stated governance proce­
    dures before deploying the system, as well as responsibility of specific individuals or entities to oversee ongoing
    assessment and mitigation. Organizational stakeholders including those with oversight of the business process
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 20
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 20
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_map@100
0.6494 50 - 0.8493
1.0 77 - 0.8737
1.2987 100 - 0.8677
1.9481 150 - 0.8859
2.0 154 - 0.8886
2.5974 200 - 0.8913
3.0 231 - 0.9058
3.2468 250 - 0.8993
3.8961 300 - 0.9077
4.0 308 - 0.9097
4.5455 350 - 0.9086
5.0 385 - 0.9165
5.1948 400 - 0.9141
5.8442 450 - 0.9132
6.0 462 - 0.9138
6.4935 500 0.3094 0.9137
7.0 539 - 0.9166
7.1429 550 - 0.9172
7.7922 600 - 0.9160
8.0 616 - 0.9169
8.4416 650 - 0.9177
9.0 693 - 0.9169
9.0909 700 - 0.9177
9.7403 750 - 0.9178
10.0 770 - 0.9178
10.3896 800 - 0.9189
11.0 847 - 0.9180
11.0390 850 - 0.9180
11.6883 900 - 0.9188
12.0 924 - 0.9192
12.3377 950 - 0.9204
12.9870 1000 0.0571 0.9202
13.0 1001 - 0.9201
13.6364 1050 - 0.9212
14.0 1078 - 0.9203
14.2857 1100 - 0.9219
14.9351 1150 - 0.9207
15.0 1155 - 0.9207
15.5844 1200 - 0.9210
16.0 1232 - 0.9208
16.2338 1250 - 0.9216
16.8831 1300 - 0.9209
17.0 1309 - 0.9209
17.5325 1350 - 0.9216
18.0 1386 - 0.9213
18.1818 1400 - 0.9221
18.8312 1450 - 0.9217
19.0 1463 - 0.9217
19.4805 1500 0.0574 0.9225
20.0 1540 - 0.9221

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}