spacemanidol commited on
Commit
026a71c
1 Parent(s): c7c6836

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -25
README.md CHANGED
@@ -7,7 +7,7 @@ tags:
7
  - sentence-similarity
8
  - mteb
9
  - arctic
10
- - arctic-embed
11
  model-index:
12
  - name: snowflake-arctic-m-long
13
  results:
@@ -2823,16 +2823,16 @@ model-index:
2823
  ## News
2824
 
2825
 
2826
- 04/16/2024: Release the ** Arctic-embed ** family of text embedding models. The releases are state-of-the-art for Retrieval quality at each of their representative size profiles. [Technical Report]() is coming shortly. For more details, please refer to our Github: [Arctic-Text-Embed](https://github.com/Snowflake-Labs/arctic-embed).
2827
 
2828
 
2829
  ## Models
2830
 
2831
 
2832
- Arctic-Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
2833
 
2834
 
2835
- The `arctic-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.
2836
 
2837
 
2838
  The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
@@ -2840,26 +2840,26 @@ The models are trained by leveraging existing open-source text representation mo
2840
 
2841
  | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
2842
  | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
2843
- | [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 | 22 | 384 |
2844
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 | 33 | 384 |
2845
- | [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 | 110 | 768 |
2846
- | [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 | 137 | 768 |
2847
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 | 335 | 1024 |
2848
 
2849
 
2850
- Aside from being great open-source models, the largest model, [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/), can serve as a natural replacement for closed-source embedding, as shown below.
2851
 
2852
 
2853
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2854
  | ------------------------------------------------------------------ | -------------------------------- |
2855
- | [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2856
  | Google-gecko-text-embedding | 55.7 |
2857
  | text-embedding-3-large | 55.44 |
2858
  | Cohere-embed-english-v3.0 | 55.00 |
2859
  | bge-large-en-v1.5 | 54.29 |
2860
 
2861
 
2862
- ### [Arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs)
2863
 
2864
 
2865
  This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model with only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
@@ -2867,14 +2867,14 @@ This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://h
2867
 
2868
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2869
  | ------------------------------------------------------------------- | -------------------------------- |
2870
- | [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 |
2871
  | GIST-all-MiniLM-L6-v2 | 45.12 |
2872
  | gte-tiny | 44.92 |
2873
  | all-MiniLM-L6-v2 | 41.95 |
2874
  | bge-micro-v2 | 42.56 |
2875
 
2876
 
2877
- ### [Arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s)
2878
 
2879
 
2880
  Based on the [intfloat/e5-small-unsupervised](https://huggingface.co/intfloat/e5-small-unsupervised) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
@@ -2882,14 +2882,14 @@ Based on the [intfloat/e5-small-unsupervised](https://huggingface.co/intfloat/e5
2882
 
2883
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2884
  | ------------------------------------------------------------------ | -------------------------------- |
2885
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 |
2886
  | bge-small-en-v1.5 | 51.68 |
2887
  | Cohere-embed-english-light-v3.0 | 51.34 |
2888
  | text-embedding-3-small | 51.08 |
2889
  | e5-small-v2 | 49.04 |
2890
 
2891
 
2892
- ### [Arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/)
2893
 
2894
 
2895
  Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
@@ -2897,13 +2897,13 @@ Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-
2897
 
2898
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2899
  | ------------------------------------------------------------------ | -------------------------------- |
2900
- | [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 |
2901
  | bge-base-en-v1.5 | 53.25 |
2902
  | nomic-embed-text-v1.5 | 53.25 |
2903
  | GIST-Embedding-v0 | 52.31 |
2904
  | gte-base | 52.31 |
2905
 
2906
- ### [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/)
2907
 
2908
 
2909
  Based on the [nomic-ai/nomic-embed-text-v1-unsupervised](https://huggingface.co/nomic-ai/nomic-embed-text-v1-unsupervised) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
@@ -2911,14 +2911,14 @@ Based on the [nomic-ai/nomic-embed-text-v1-unsupervised](https://huggingface.co/
2911
 
2912
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2913
  | ------------------------------------------------------------------ | -------------------------------- |
2914
- | [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 |
2915
  | nomic-embed-text-v1.5 | 53.01 |
2916
  | nomic-embed-text-v1 | 52.81 |
2917
 
2918
 
2919
 
2920
 
2921
- ### [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/)
2922
 
2923
 
2924
  Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
@@ -2926,7 +2926,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
2926
 
2927
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2928
  | ------------------------------------------------------------------ | -------------------------------- |
2929
- | [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2930
  | UAE-Large-V1 | 54.66 |
2931
  | bge-large-en-v1.5 | 54.29 |
2932
  | mxbai-embed-large-v1 | 54.39 |
@@ -2939,7 +2939,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
2939
  ### Using Huggingface transformers
2940
 
2941
 
2942
- You can use the transformers package to use an arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
2943
 
2944
 
2945
 
@@ -2947,8 +2947,8 @@ You can use the transformers package to use an arctic-embed model, as shown belo
2947
  import torch
2948
  from transformers import AutoModel, AutoTokenizer
2949
 
2950
- tokenizer = AutoTokenizer.from_pretrained('Snowflake/arctic-embed-m-long')
2951
- model = AutoModel.from_pretrained('Snowflake/arctic-embed-m-long', add_pooling_layer=False)
2952
  model.eval()
2953
 
2954
  query_prefix = 'Represent this sentence for searching relevant passages: '
@@ -2984,7 +2984,7 @@ If you use the long context model with more than 2048 tokens, ensure that you in
2984
 
2985
 
2986
  ``` py
2987
- model = AutoModel.from_pretrained('Snowflake/arctic-embed-m-long', trust_remote_code=True, rotary_scaling_factor=2)
2988
  ```
2989
 
2990
 
 
7
  - sentence-similarity
8
  - mteb
9
  - arctic
10
+ - snowflake-arctic-embed
11
  model-index:
12
  - name: snowflake-arctic-m-long
13
  results:
 
2823
  ## News
2824
 
2825
 
2826
+ 04/16/2024: Release the ** snowflake-arctic-embed ** family of text embedding models. The releases are state-of-the-art for Retrieval quality at each of their representative size profiles. [Technical Report]() is coming shortly. For more details, please refer to our Github: [Arctic-Text-Embed](https://github.com/Snowflake-Labs/snowflake-arctic-embed).
2827
 
2828
 
2829
  ## Models
2830
 
2831
 
2832
+ snowflake-arctic-embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
2833
 
2834
 
2835
+ The `snowflake-arctic-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/snowflake-arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.
2836
 
2837
 
2838
  The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
 
2840
 
2841
  | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
2842
  | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
2843
+ | [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/) | 50.15 | 22 | 384 |
2844
+ | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s/) | 51.98 | 33 | 384 |
2845
+ | [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/) | 54.90 | 110 | 768 |
2846
+ | [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/) | 54.83 | 137 | 768 |
2847
+ | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 | 335 | 1024 |
2848
 
2849
 
2850
+ Aside from being great open-source models, the largest model, [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/), can serve as a natural replacement for closed-source embedding, as shown below.
2851
 
2852
 
2853
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2854
  | ------------------------------------------------------------------ | -------------------------------- |
2855
+ | [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 |
2856
  | Google-gecko-text-embedding | 55.7 |
2857
  | text-embedding-3-large | 55.44 |
2858
  | Cohere-embed-english-v3.0 | 55.00 |
2859
  | bge-large-en-v1.5 | 54.29 |
2860
 
2861
 
2862
+ ### [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs)
2863
 
2864
 
2865
  This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model with only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
 
2867
 
2868
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2869
  | ------------------------------------------------------------------- | -------------------------------- |
2870
+ | [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/) | 50.15 |
2871
  | GIST-all-MiniLM-L6-v2 | 45.12 |
2872
  | gte-tiny | 44.92 |
2873
  | all-MiniLM-L6-v2 | 41.95 |
2874
  | bge-micro-v2 | 42.56 |
2875
 
2876
 
2877
+ ### [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s)
2878
 
2879
 
2880
  Based on the [intfloat/e5-small-unsupervised](https://huggingface.co/intfloat/e5-small-unsupervised) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
 
2882
 
2883
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2884
  | ------------------------------------------------------------------ | -------------------------------- |
2885
+ | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s/) | 51.98 |
2886
  | bge-small-en-v1.5 | 51.68 |
2887
  | Cohere-embed-english-light-v3.0 | 51.34 |
2888
  | text-embedding-3-small | 51.08 |
2889
  | e5-small-v2 | 49.04 |
2890
 
2891
 
2892
+ ### [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/)
2893
 
2894
 
2895
  Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
 
2897
 
2898
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2899
  | ------------------------------------------------------------------ | -------------------------------- |
2900
+ | [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/) | 54.90 |
2901
  | bge-base-en-v1.5 | 53.25 |
2902
  | nomic-embed-text-v1.5 | 53.25 |
2903
  | GIST-Embedding-v0 | 52.31 |
2904
  | gte-base | 52.31 |
2905
 
2906
+ ### [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/)
2907
 
2908
 
2909
  Based on the [nomic-ai/nomic-embed-text-v1-unsupervised](https://huggingface.co/nomic-ai/nomic-embed-text-v1-unsupervised) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
 
2911
 
2912
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2913
  | ------------------------------------------------------------------ | -------------------------------- |
2914
+ | [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/) | 54.83 |
2915
  | nomic-embed-text-v1.5 | 53.01 |
2916
  | nomic-embed-text-v1 | 52.81 |
2917
 
2918
 
2919
 
2920
 
2921
+ ### [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/)
2922
 
2923
 
2924
  Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
 
2926
 
2927
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2928
  | ------------------------------------------------------------------ | -------------------------------- |
2929
+ | [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 |
2930
  | UAE-Large-V1 | 54.66 |
2931
  | bge-large-en-v1.5 | 54.29 |
2932
  | mxbai-embed-large-v1 | 54.39 |
 
2939
  ### Using Huggingface transformers
2940
 
2941
 
2942
+ You can use the transformers package to use an snowflake-arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
2943
 
2944
 
2945
 
 
2947
  import torch
2948
  from transformers import AutoModel, AutoTokenizer
2949
 
2950
+ tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-m-long')
2951
+ model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-long', add_pooling_layer=False)
2952
  model.eval()
2953
 
2954
  query_prefix = 'Represent this sentence for searching relevant passages: '
 
2984
 
2985
 
2986
  ``` py
2987
+ model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-long', trust_remote_code=True, rotary_scaling_factor=2)
2988
  ```
2989
 
2990