metadata

base_model: Snowflake/snowflake-arctic-embed-m
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:40
  - loss:CosineSimilarityLoss
widget:
  - source_sentence: What role does NIST play in establishing AI standards?
    sentences:
      - >-
        provides examples and concrete steps for communities, industry,
        governments, and others to take in order to 

        build these protections into policy, practice, or the technological
        design process. 

        Taken together, the technical protections and practices laid out in the
        Blueprint for an AI Bill of Rights can help 

        guard the American public against many of the potential and actual harms
        identified by researchers, technolo
      - >-
        provides examples and concrete steps for communities, industry,
        governments, and others to take in order to 

        build these protections into policy, practice, or the technological
        design process. 

        Taken together, the technical protections and practices laid out in the
        Blueprint for an AI Bill of Rights can help 

        guard the American public against many of the potential and actual harms
        identified by researchers, technolo
      - >-
        Acknowledgments: This report was accomplished with the many helpful
        comments and contributions 

        from the community, including the NIST Generative AI Public Working
        Group, and NIST staﬀ and guest 

        researchers: Chloe Autio, Jesse Dunietz, Patrick Hall, Shomik Jain,
        Kamie Roberts, Reva Schwartz, Martin 

        Stanley, and Elham Tabassi. 

        NIST Technical Series Policies 

        Copyright, Use, and Licensing Statements 

        NIST Technical Series Publication Identifier Syntax 

        Publication History
  - source_sentence: What are the implications of AI in decision-making processes?
    sentences:
      - >-
        The measures taken to realize the vision set forward in this framework
        should be proportionate 

        with the extent and nature of the harm, or risk of harm, to people's
        rights, opportunities, and 

        access. 

        RELATIONSHIP TO EXISTING LAW AND POLICY

        The Blueprint for an AI Bill of Rights is an exercise in envisioning a
        future where the American public is 

        protected from the potential harms, and can fully enjoy the benefits, of
        automated systems. It describes princi
      - >-
        state of the science of AI measurement and safety today. This document
        focuses on risks for which there 

        is an existing empirical evidence base at the time this proﬁle was
        written; for example, speculative risks 

        that may potentially arise in more advanced, future GAI systems are not
        considered. Future updates may 

        incorporate additional risks or provide further details on the risks
        identiﬁed below.
      - >-
        development of automated systems that adhere to and advance their
        safety, security and 

        effectiveness. Multiple NSF programs support research that directly
        addresses many of these principles: 

        the National AI Research Institutes23 support research on all aspects of
        safe, trustworthy, fair, and explainable 

        AI algorithms and systems; the Cyber Physical Systems24 program supports
        research on developing safe
  - source_sentence: >-
      How are AI systems validated for safety and fairness according to NIST
      standards?
    sentences:
      - >-
        tion and advises on implementation of the DOE AI Strategy and addresses
        issues and/or escalations on the 

        ethical use and development of AI systems.20 The Department of Defense
        has adopted Artificial Intelligence 

        Ethical Principles, and tenets for Responsible Artificial Intelligence
        specifically tailored to its national 

        security and defense activities.21 Similarly, the U.S. Intelligence
        Community (IC) has developed the Principles
      - >-
        GOVERN 1.1: Legal and regulatory requirements involving AI are
        understood, managed, and documented.  

        Action ID 

        Suggested Action 

        GAI Risks 

        GV-1.1-001 Align GAI development and use with applicable laws and
        regulations, including 

        those related to data privacy, copyright and intellectual property law. 

        Data Privacy; Harmful Bias and 

        Homogenization; Intellectual 

        Property 

        AI Actor Tasks: Governance and Oversight
      - >-
        more than a decade, is also helping to fulﬁll the 2023 Executive Order
        on Safe, Secure, and Trustworthy 

        AI. NIST established the U.S. AI Safety Institute and the companion AI
        Safety Institute Consortium to 

        continue the eﬀorts set in motion by the E.O. to build the science
        necessary for safe, secure, and 

        trustworthy development and use of AI. 

        Acknowledgments: This report was accomplished with the many helpful
        comments and contributions
  - source_sentence: How does the AI Bill of Rights protect individual privacy?
    sentences:
      - >-
        match the statistical properties of real-world data without disclosing
        personally 

        identiﬁable information or contributing to homogenization. 

        Data Privacy; Intellectual Property; 

        Information Integrity; 

        Confabulation; Harmful Bias and 

        Homogenization 

        AI Actor Tasks: AI Deployment, AI Impact Assessment, Governance and
        Oversight, Operation and Monitoring 
         
        MANAGE 2.3: Procedures are followed to respond to and recover from a
        previously unknown risk when it is identiﬁed. 

        Action ID
      - >-
        the principles described in the Blueprint for an AI Bill of Rights may
        be necessary to comply with existing law, 

        conform to the practicalities of a specific use case, or balance
        competing public interests. In particular, law 

        enforcement, and other regulatory contexts may require government actors
        to protect civil rights, civil liberties, 

        and privacy in a manner consistent with, but using alternate mechanisms
        to, the specific principles discussed in
      - >-
        civil rights, civil liberties, and privacy. The Blueprint for an AI Bill
        of Rights includes this Foreword, the five 

        principles, notes on Applying the The Blueprint for an AI Bill of
        Rights, and a Technical Companion that gives 

        concrete steps that can be taken by many kinds of organizations—from
        governments at all levels to companies of 

        all sizes—to uphold these values. Experts from across the private
        sector, governments, and international
  - source_sentence: How does the AI Bill of Rights protect individual privacy?
    sentences:
      - >-
        57 

        National Institute of Standards and Technology (2023) AI Risk Management
        Framework, Appendix B: 

        How AI Risks Diﬀer from Traditional Software Risks. 

        https://airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF/Appendices/Appendix_B 

        National Institute of Standards and Technology (2023) AI RMF Playbook. 

        https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook 

        National Institue of Standards and Technology (2023) Framing Risk
      - >-
        principles for managing information about individuals have been
        incorporated into data privacy laws and 

        policies across the globe.5 The Blueprint for an AI Bill of Rights
        embraces elements of the FIPPs that are 

        particularly relevant to automated systems, without articulating a
        specific set of FIPPs or scoping 

        applicability or the interests served to a single particular domain,
        like privacy, civil rights and civil liberties,
      - >-
        harmful 

        uses. 

        The 

        NIST 

        framework 

        will 

        consider 

        and 

        encompass 

        principles 

        such 

        as 

        transparency, accountability, and fairness during pre-design, design and
        development, deployment, use, 

        and testing and evaluation of AI technologies and systems. It is
        expected to be released in the winter of 2022-23. 

        21
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: val
          type: val
        metrics:
          - type: pearson_cosine
            value: 0.6585006489314952
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.582665729755017
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.6
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.6722783219807118
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.7
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.6585002582595083
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7
            name: Spearman Dot
          - type: pearson_max
            value: 0.6722783219807118
            name: Pearson Max
          - type: spearman_max
            value: 0.7
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: test
          type: test
        metrics:
          - type: pearson_cosine
            value: 0.7463407966146629
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7999999999999999
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.7475379067038609
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.7999999999999999
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.7592380598802199
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.7999999999999999
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.7463412670178408
            name: Pearson Dot
          - type: spearman_dot
            value: 0.7999999999999999
            name: Spearman Dot
          - type: pearson_max
            value: 0.7592380598802199
            name: Pearson Max
          - type: spearman_max
            value: 0.7999999999999999
            name: Spearman Max

SentenceTransformer based on Snowflake/snowflake-arctic-embed-m

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Snowflake/snowflake-arctic-embed-m
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("gmedrano/snowflake-arctic-embed-m-finetuned")
# Run inference
sentences = [
    'How does the AI Bill of Rights protect individual privacy?',
    'principles for managing information about individuals have been incorporated into data privacy laws and \npolicies across the globe.5 The Blueprint for an AI Bill of Rights embraces elements of the FIPPs that are \nparticularly relevant to automated systems, without articulating a specific set of FIPPs or scoping \napplicability or the interests served to a single particular domain, like privacy, civil rights and civil liberties,',
    'harmful \nuses. \nThe \nNIST \nframework \nwill \nconsider \nand \nencompass \nprinciples \nsuch \nas \ntransparency, accountability, and fairness during pre-design, design and development, deployment, use, \nand testing and evaluation of AI technologies and systems. It is expected to be released in the winter of 2022-23. \n21',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Dataset: val
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.6585
spearman_cosine	0.7
pearson_manhattan	0.5827
spearman_manhattan	0.6
pearson_euclidean	0.6723
spearman_euclidean	0.7
pearson_dot	0.6585
spearman_dot	0.7
pearson_max	0.6723
spearman_max	0.7

Semantic Similarity

Dataset: test
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.7463
spearman_cosine	0.8
pearson_manhattan	0.7475
spearman_manhattan	0.8
pearson_euclidean	0.7592
spearman_euclidean	0.8
pearson_dot	0.7463
spearman_dot	0.8
pearson_max	0.7592
spearman_max	0.8

Training Details

Training Dataset

Unnamed Dataset

Size: 40 training samples
Columns: sentence_0, sentence_1, and label
Approximate statistics based on the first 40 samples:
sentence_0 sentence_1 label
type string string float
details
min: 12 tokens
mean: 14.43 tokens
max: 18 tokens

min: 41 tokens
mean: 80.55 tokens
max: 117 tokens

min: 0.53
mean: 0.61
max: 0.76

	sentence_0	sentence_1	label
type	string	string	float
details	min: 12 tokens mean: 14.43 tokens max: 18 tokens	min: 41 tokens mean: 80.55 tokens max: 117 tokens	min: 0.53 mean: 0.61 max: 0.76

Samples:

sentence_0	sentence_1	label
`What should business leaders understand about AI risk management?`	`57 National Institute of Standards and Technology (2023) AI Risk Management Framework, Appendix B: How AI Risks Diﬀer from Traditional Software Risks. https://airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF/Appendices/Appendix_B National Institute of Standards and Technology (2023) AI RMF Playbook. https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook National Institue of Standards and Technology (2023) Framing Risk`	`0.5692041097520776`
`What kind of data protection measures are required under current AI regulations?`	`GOVERN 1.1: Legal and regulatory requirements involving AI are understood, managed, and documented. Action ID Suggested Action GAI Risks GV-1.1-001 Align GAI development and use with applicable laws and regulations, including those related to data privacy, copyright and intellectual property law. Data Privacy; Harmful Bias and Homogenization; Intellectual Property AI Actor Tasks: Governance and Oversight`	`0.5830958798587019`
`What are the implications of AI in decision-making processes?`	`state of the science of AI measurement and safety today. This document focuses on risks for which there is an existing empirical evidence base at the time this proﬁle was written; for example, speculative risks that may potentially arise in more advanced, future GAI systems are not considered. Future updates may incorporate additional risks or provide further details on the risks identiﬁed below.`	`0.5317174553776045`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	test_spearman_max	val_spearman_max
1.0	3	-	0.6
2.0	6	-	0.7
3.0	9	0.8000	0.7

Framework Versions

Python: 3.11.9
Sentence Transformers: 3.1.1
Transformers: 4.44.2
PyTorch: 2.2.2
Accelerate: 0.34.2
Datasets: 3.0.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}