gmedrano's picture
Add new SentenceTransformer model.
ef5dd98 verified
metadata
base_model: Snowflake/snowflake-arctic-embed-m
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:40
  - loss:CosineSimilarityLoss
widget:
  - source_sentence: What role does NIST play in establishing AI standards?
    sentences:
      - >-
        provides examples and concrete steps for communities, industry,
        governments, and others to take in order to 

        build these protections into policy, practice, or the technological
        design process. 

        Taken together, the technical protections and practices laid out in the
        Blueprint for an AI Bill of Rights can help 

        guard the American public against many of the potential and actual harms
        identified by researchers, technolo­
      - >-
        provides examples and concrete steps for communities, industry,
        governments, and others to take in order to 

        build these protections into policy, practice, or the technological
        design process. 

        Taken together, the technical protections and practices laid out in the
        Blueprint for an AI Bill of Rights can help 

        guard the American public against many of the potential and actual harms
        identified by researchers, technolo­
      - >-
        Acknowledgments: This report was accomplished with the many helpful
        comments and contributions 

        from the community, including the NIST Generative AI Public Working
        Group, and NIST staff and guest 

        researchers: Chloe Autio, Jesse Dunietz, Patrick Hall, Shomik Jain,
        Kamie Roberts, Reva Schwartz, Martin 

        Stanley, and Elham Tabassi. 

        NIST Technical Series Policies 

        Copyright, Use, and Licensing Statements 

        NIST Technical Series Publication Identifier Syntax 

        Publication History
  - source_sentence: What are the implications of AI in decision-making processes?
    sentences:
      - >-
        The measures taken to realize the vision set forward in this framework
        should be proportionate 

        with the extent and nature of the harm, or risk of harm, to people's
        rights, opportunities, and 

        access. 

        RELATIONSHIP TO EXISTING LAW AND POLICY

        The Blueprint for an AI Bill of Rights is an exercise in envisioning a
        future where the American public is 

        protected from the potential harms, and can fully enjoy the benefits, of
        automated systems. It describes princi­
      - >-
        state of the science of AI measurement and safety today. This document
        focuses on risks for which there 

        is an existing empirical evidence base at the time this profile was
        written; for example, speculative risks 

        that may potentially arise in more advanced, future GAI systems are not
        considered. Future updates may 

        incorporate additional risks or provide further details on the risks
        identified below.
      - >-
        development of automated systems that adhere to and advance their
        safety, security and 

        effectiveness. Multiple NSF programs support research that directly
        addresses many of these principles: 

        the National AI Research Institutes23 support research on all aspects of
        safe, trustworthy, fair, and explainable 

        AI algorithms and systems; the Cyber Physical Systems24 program supports
        research on developing safe
  - source_sentence: >-
      How are AI systems validated for safety and fairness according to NIST
      standards?
    sentences:
      - >-
        tion and advises on implementation of the DOE AI Strategy and addresses
        issues and/or escalations on the 

        ethical use and development of AI systems.20 The Department of Defense
        has adopted Artificial Intelligence 

        Ethical Principles, and tenets for Responsible Artificial Intelligence
        specifically tailored to its national 

        security and defense activities.21 Similarly, the U.S. Intelligence
        Community (IC) has developed the Principles
      - >-
        GOVERN 1.1: Legal and regulatory requirements involving AI are
        understood, managed, and documented.  

        Action ID 

        Suggested Action 

        GAI Risks 

        GV-1.1-001 Align GAI development and use with applicable laws and
        regulations, including 

        those related to data privacy, copyright and intellectual property law. 

        Data Privacy; Harmful Bias and 

        Homogenization; Intellectual 

        Property 

        AI Actor Tasks: Governance and Oversight
      - >-
        more than a decade, is also helping to fulfill the 2023 Executive Order
        on Safe, Secure, and Trustworthy 

        AI. NIST established the U.S. AI Safety Institute and the companion AI
        Safety Institute Consortium to 

        continue the efforts set in motion by the E.O. to build the science
        necessary for safe, secure, and 

        trustworthy development and use of AI. 

        Acknowledgments: This report was accomplished with the many helpful
        comments and contributions
  - source_sentence: How does the AI Bill of Rights protect individual privacy?
    sentences:
      - >-
        match the statistical properties of real-world data without disclosing
        personally 

        identifiable information or contributing to homogenization. 

        Data Privacy; Intellectual Property; 

        Information Integrity; 

        Confabulation; Harmful Bias and 

        Homogenization 

        AI Actor Tasks: AI Deployment, AI Impact Assessment, Governance and
        Oversight, Operation and Monitoring 
         
        MANAGE 2.3: Procedures are followed to respond to and recover from a
        previously unknown risk when it is identified. 

        Action ID
      - >-
        the principles described in the Blueprint for an AI Bill of Rights may
        be necessary to comply with existing law, 

        conform to the practicalities of a specific use case, or balance
        competing public interests. In particular, law 

        enforcement, and other regulatory contexts may require government actors
        to protect civil rights, civil liberties, 

        and privacy in a manner consistent with, but using alternate mechanisms
        to, the specific principles discussed in
      - >-
        civil rights, civil liberties, and privacy. The Blueprint for an AI Bill
        of Rights includes this Foreword, the five 

        principles, notes on Applying the The Blueprint for an AI Bill of
        Rights, and a Technical Companion that gives 

        concrete steps that can be taken by many kinds of organizations—from
        governments at all levels to companies of 

        all sizes—to uphold these values. Experts from across the private
        sector, governments, and international
  - source_sentence: How does the AI Bill of Rights protect individual privacy?
    sentences:
      - >-
        57 

        National Institute of Standards and Technology (2023) AI Risk Management
        Framework, Appendix B: 

        How AI Risks Differ from Traditional Software Risks. 

        https://airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF/Appendices/Appendix_B 

        National Institute of Standards and Technology (2023) AI RMF Playbook. 

        https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook 

        National Institue of Standards and Technology (2023) Framing Risk
      - >-
        principles for managing information about individuals have been
        incorporated into data privacy laws and 

        policies across the globe.5 The Blueprint for an AI Bill of Rights
        embraces elements of the FIPPs that are 

        particularly relevant to automated systems, without articulating a
        specific set of FIPPs or scoping 

        applicability or the interests served to a single particular domain,
        like privacy, civil rights and civil liberties,
      - >-
        harmful 

        uses. 

        The 

        NIST 

        framework 

        will 

        consider 

        and 

        encompass 

        principles 

        such 

        as 

        transparency, accountability, and fairness during pre-design, design and
        development, deployment, use, 

        and testing and evaluation of AI technologies and systems. It is
        expected to be released in the winter of 2022-23. 

        21
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: val
          type: val
        metrics:
          - type: pearson_cosine
            value: 0.6585006489314952
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.582665729755017
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.6
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.6722783219807118
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.7
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.6585002582595083
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7
            name: Spearman Dot
          - type: pearson_max
            value: 0.6722783219807118
            name: Pearson Max
          - type: spearman_max
            value: 0.7
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: test
          type: test
        metrics:
          - type: pearson_cosine
            value: 0.7463407966146629
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7999999999999999
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.7475379067038609
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.7999999999999999
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.7592380598802199
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.7999999999999999
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.7463412670178408
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7999999999999999
            name: Spearman Dot
          - type: pearson_max
            value: 0.7592380598802199
            name: Pearson Max
          - type: spearman_max
            value: 0.7999999999999999
            name: Spearman Max

SentenceTransformer based on Snowflake/snowflake-arctic-embed-m

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-m
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("gmedrano/snowflake-arctic-embed-m-finetuned")
# Run inference
sentences = [
    'How does the AI Bill of Rights protect individual privacy?',
    'principles for managing information about individuals have been incorporated into data privacy laws and \npolicies across the globe.5 The Blueprint for an AI Bill of Rights embraces elements of the FIPPs that are \nparticularly relevant to automated systems, without articulating a specific set of FIPPs or scoping \napplicability or the interests served to a single particular domain, like privacy, civil rights and civil liberties,',
    'harmful \nuses. \nThe \nNIST \nframework \nwill \nconsider \nand \nencompass \nprinciples \nsuch \nas \ntransparency, accountability, and fairness during pre-design, design and development, deployment, use, \nand testing and evaluation of AI technologies and systems. It is expected to be released in the winter of 2022-23. \n21',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.6585
spearman_cosine 0.7
pearson_manhattan 0.5827
spearman_manhattan 0.6
pearson_euclidean 0.6723
spearman_euclidean 0.7
pearson_dot 0.6585
spearman_dot 0.7
pearson_max 0.6723
spearman_max 0.7

Semantic Similarity

Metric Value
pearson_cosine 0.7463
spearman_cosine 0.8
pearson_manhattan 0.7475
spearman_manhattan 0.8
pearson_euclidean 0.7592
spearman_euclidean 0.8
pearson_dot 0.7463
spearman_dot 0.8
pearson_max 0.7592
spearman_max 0.8

Training Details

Training Dataset

Unnamed Dataset

  • Size: 40 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 40 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 12 tokens
    • mean: 14.43 tokens
    • max: 18 tokens
    • min: 41 tokens
    • mean: 80.55 tokens
    • max: 117 tokens
    • min: 0.53
    • mean: 0.61
    • max: 0.76
  • Samples:
    sentence_0 sentence_1 label
    What should business leaders understand about AI risk management? 57
    National Institute of Standards and Technology (2023) AI Risk Management Framework, Appendix B:
    How AI Risks Differ from Traditional Software Risks.
    https://airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF/Appendices/Appendix_B
    National Institute of Standards and Technology (2023) AI RMF Playbook.
    https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook
    National Institue of Standards and Technology (2023) Framing Risk
    0.5692041097520776
    What kind of data protection measures are required under current AI regulations? GOVERN 1.1: Legal and regulatory requirements involving AI are understood, managed, and documented.
    Action ID
    Suggested Action
    GAI Risks
    GV-1.1-001 Align GAI development and use with applicable laws and regulations, including
    those related to data privacy, copyright and intellectual property law.
    Data Privacy; Harmful Bias and
    Homogenization; Intellectual
    Property
    AI Actor Tasks: Governance and Oversight
    0.5830958798587019
    What are the implications of AI in decision-making processes? state of the science of AI measurement and safety today. This document focuses on risks for which there
    is an existing empirical evidence base at the time this profile was written; for example, speculative risks
    that may potentially arise in more advanced, future GAI systems are not considered. Future updates may
    incorporate additional risks or provide further details on the risks identified below.
    0.5317174553776045
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step test_spearman_max val_spearman_max
1.0 3 - 0.6
2.0 6 - 0.7
3.0 9 0.8000 0.7

Framework Versions

  • Python: 3.11.9
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.2.2
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}