yaniseuranova commited on
Commit
2765549
1 Parent(s): b80fb90

Add SetFit model

Browse files
1_Pooling/config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
- "word_embedding_dimension": 768,
3
- "pooling_mode_cls_token": false,
4
- "pooling_mode_mean_tokens": true,
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
 
1
  {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
README.md CHANGED
@@ -1,31 +1,26 @@
1
  ---
 
2
  library_name: setfit
 
 
 
3
  tags:
4
  - setfit
5
  - sentence-transformers
6
  - text-classification
7
  - generated_from_setfit_trainer
8
- base_model: sentence-transformers/all-mpnet-base-v2
9
  widget:
10
- - text: >-
11
- How do companies balance individual creativity with team collaboration to
12
- drive innovation in the work place?
13
- - text: >-
14
- How do the values of a learning organization impact its ability to innovate
15
- and respond to constant change?
16
- - text: >-
17
- What is the primary function of the Domain Name System (DNS) layer in the
18
- Internet Protocol Stack, as defined by ICANN?
19
- - text: >-
20
- What distinguishes a transforming industry from one that merely innovates to
21
- existing practices?
22
- - text: >-
23
- How can artificial intelligence systems balance individual autonomy with
24
- collective responsibility in decision-making processes?
25
- pipeline_tag: text-classification
26
  inference: true
27
  model-index:
28
- - name: SetFit RAG query classificator for hybrid search query routing
29
  results:
30
  - task:
31
  type: text-classification
@@ -36,21 +31,13 @@ model-index:
36
  split: test
37
  metrics:
38
  - type: accuracy
39
- value: 1
40
  name: Accuracy
41
- language:
42
- - en
43
  ---
44
 
45
- # Fast Query Routing for RAG Hybrid Search Using Setfit Tuned Embedding Model.
46
-
47
- The goal of this model is to classify users queries in a RAG pipeline between two classes 'semantic' and 'lexical'. This allow an easy query routing in the context of hybrid search
48
- and alpha tuning for hybrid search. A query is considered 'semantic' if it doesn't contain any particular jargon, proper noun, technical terms, ect.. on the other hand it is considered lexical
49
- if there are precise keywords than can be used to make a lexical search (BM25 for example).
50
-
51
- The model is very small and fast, thus enabling a very cost-effective approach for query routing comparing to use large LLMs such as GPT4 for query routing !
52
 
53
- The model was trained using the [SetFit](https://github.com/huggingface/setfit) method that allows Text Classification model finetuning with a reduced number of human annotated training examples. This SetFit model uses [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
54
 
55
  The model has been trained using an efficient few-shot learning technique that involves:
56
 
@@ -61,9 +48,9 @@ The model has been trained using an efficient few-shot learning technique that i
61
 
62
  ### Model Description
63
  - **Model Type:** SetFit
64
- - **Sentence Transformer body:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
65
  - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
66
- - **Maximum Sequence Length:** 384 tokens
67
  - **Number of Classes:** 2 classes
68
  <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
69
  <!-- - **Language:** Unknown -->
@@ -76,17 +63,17 @@ The model has been trained using an efficient few-shot learning technique that i
76
  - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
77
 
78
  ### Model Labels
79
- | Label | Examples |
80
- |:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
81
- | lexical | <ul><li>'What is the primary function of the Apache Kafka distributed streaming platform in Big Data processing?'</li><li>"What is the primary difference between Hadoop's FileSystem-based architecture and Apache Cassandra's distributed, masterlessArchitecture in scale-out design?"</li><li>'What is the main difference between optimistic concurrency control and pessimistic concurrency control in database management systems?'</li></ul> |
82
- | semantic | <ul><li>"How does organizational morale impact the competitiveness of a company in today's fast-paced market?"</li><li>'How do organizations balance individual creativity with collective goal achievement in a dynamic environment?'</li><li>'What is a key challenge faced by managers in sustaining a work culture that encourages creativity, innovation, and critical thinking within the technological industry globally?'</li></ul> |
83
 
84
  ## Evaluation
85
 
86
  ### Metrics
87
  | Label | Accuracy |
88
  |:--------|:---------|
89
- | **all** | 1.0 |
90
 
91
  ## Uses
92
 
@@ -104,9 +91,9 @@ Then you can load this model and run inference.
104
  from setfit import SetFitModel
105
 
106
  # Download from the 🤗 Hub
107
- model = SetFitModel.from_pretrained("yaniseuranova/setfit-paraphrase-mpnet-base-v2-sst2")
108
  # Run inference
109
- preds = model("What distinguishes a transforming industry from one that merely innovates to existing practices?")
110
  ```
111
 
112
  <!--
@@ -138,12 +125,12 @@ preds = model("What distinguishes a transforming industry from one that merely i
138
  ### Training Set Metrics
139
  | Training set | Min | Median | Max |
140
  |:-------------|:----|:--------|:----|
141
- | Word count | 4 | 19.1839 | 42 |
142
 
143
  | Label | Training Sample Count |
144
  |:---------|:----------------------|
145
- | lexical | 43 |
146
- | semantic | 44 |
147
 
148
  ### Training Hyperparameters
149
  - batch_size: (8, 8)
@@ -165,39 +152,130 @@ preds = model("What distinguishes a transforming industry from one that merely i
165
  ### Training Results
166
  | Epoch | Step | Training Loss | Validation Loss |
167
  |:-------:|:--------:|:-------------:|:---------------:|
168
- | 0.0021 | 1 | 0.301 | - |
169
- | 0.1033 | 50 | 0.1244 | - |
170
- | 0.2066 | 100 | 0.0021 | - |
171
- | 0.3099 | 150 | 0.0006 | - |
172
- | 0.4132 | 200 | 0.0002 | - |
173
- | 0.5165 | 250 | 0.0002 | - |
174
- | 0.6198 | 300 | 0.0001 | - |
175
- | 0.7231 | 350 | 0.0001 | - |
176
- | 0.8264 | 400 | 0.0001 | - |
177
- | 0.9298 | 450 | 0.0001 | - |
178
- | 1.0 | 484 | - | 0.0001 |
179
- | 1.0331 | 500 | 0.0001 | - |
180
- | 1.1364 | 550 | 0.0001 | - |
181
- | 1.2397 | 600 | 0.0001 | - |
182
- | 1.3430 | 650 | 0.0 | - |
183
- | 1.4463 | 700 | 0.0001 | - |
184
- | 1.5496 | 750 | 0.0001 | - |
185
- | 1.6529 | 800 | 0.0001 | - |
186
- | 1.7562 | 850 | 0.0001 | - |
187
- | 1.8595 | 900 | 0.0 | - |
188
- | 1.9628 | 950 | 0.0 | - |
189
- | 2.0 | 968 | - | 0.0001 |
190
- | 2.0661 | 1000 | 0.0001 | - |
191
- | 2.1694 | 1050 | 0.0001 | - |
192
- | 2.2727 | 1100 | 0.0 | - |
193
- | 2.3760 | 1150 | 0.0 | - |
194
- | 2.4793 | 1200 | 0.0 | - |
195
- | 2.5826 | 1250 | 0.0 | - |
196
- | 2.6860 | 1300 | 0.0001 | - |
197
- | 2.7893 | 1350 | 0.0 | - |
198
- | 2.8926 | 1400 | 0.0001 | - |
199
- | 2.9959 | 1450 | 0.0 | - |
200
- | **3.0** | **1452** | **-** | **0.0001** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
  * The bold row denotes the saved checkpoint.
203
  ### Framework Versions
 
1
  ---
2
+ base_model: BAAI/bge-m3
3
  library_name: setfit
4
+ metrics:
5
+ - accuracy
6
+ pipeline_tag: text-classification
7
  tags:
8
  - setfit
9
  - sentence-transformers
10
  - text-classification
11
  - generated_from_setfit_trainer
 
12
  widget:
13
+ - text: How does technology impact our daily lives and what benefits can it bring
14
+ to various activities?
15
+ - text: How do organizations effectively deploy and manage machine learning algorithms
16
+ to drive business value?
17
+ - text: What are the key considerations for organizing and managing computer lab resources
18
+ and tracking their status?
19
+ - text: How can batch processing improve the efficiency of data lake operations?
20
+ - text: What is the purpose of setting up a CUPS on a server?
 
 
 
 
 
 
 
 
21
  inference: true
22
  model-index:
23
+ - name: SetFit with BAAI/bge-m3
24
  results:
25
  - task:
26
  type: text-classification
 
31
  split: test
32
  metrics:
33
  - type: accuracy
34
+ value: 0.8947368421052632
35
  name: Accuracy
 
 
36
  ---
37
 
38
+ # SetFit with BAAI/bge-m3
 
 
 
 
 
 
39
 
40
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
41
 
42
  The model has been trained using an efficient few-shot learning technique that involves:
43
 
 
48
 
49
  ### Model Description
50
  - **Model Type:** SetFit
51
+ - **Sentence Transformer body:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
52
  - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
53
+ - **Maximum Sequence Length:** 8192 tokens
54
  - **Number of Classes:** 2 classes
55
  <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
56
  <!-- - **Language:** Unknown -->
 
63
  - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
64
 
65
  ### Model Labels
66
+ | Label | Examples |
67
+ |:---------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
68
+ | lexical | <ul><li>"How does Happeo's search AI work to provide answers to user queries?"</li><li>'What are the primary areas of focus in the domain of Data Science and Analysis?'</li><li>'How can one organize a running event in Belgium?'</li></ul> |
69
+ | semantic | <ul><li>'What changes can be made to a channel header?'</li><li>'How can hardware capabilities impact the accuracy of motion and object detections?'</li><li>'Who is responsible for managing guarantees and prolongations?'</li></ul> |
70
 
71
  ## Evaluation
72
 
73
  ### Metrics
74
  | Label | Accuracy |
75
  |:--------|:---------|
76
+ | **all** | 0.8947 |
77
 
78
  ## Uses
79
 
 
91
  from setfit import SetFitModel
92
 
93
  # Download from the 🤗 Hub
94
+ model = SetFitModel.from_pretrained("yaniseuranova/setfit-rag-hybrid-search-query-router")
95
  # Run inference
96
+ preds = model("What is the purpose of setting up a CUPS on a server?")
97
  ```
98
 
99
  <!--
 
125
  ### Training Set Metrics
126
  | Training set | Min | Median | Max |
127
  |:-------------|:----|:--------|:----|
128
+ | Word count | 4 | 13.7407 | 28 |
129
 
130
  | Label | Training Sample Count |
131
  |:---------|:----------------------|
132
+ | lexical | 44 |
133
+ | semantic | 118 |
134
 
135
  ### Training Hyperparameters
136
  - batch_size: (8, 8)
 
152
  ### Training Results
153
  | Epoch | Step | Training Loss | Validation Loss |
154
  |:-------:|:--------:|:-------------:|:---------------:|
155
+ | 0.0005 | 1 | 0.257 | - |
156
+ | 0.0250 | 50 | 0.1944 | - |
157
+ | 0.0499 | 100 | 0.2383 | - |
158
+ | 0.0749 | 150 | 0.1279 | - |
159
+ | 0.0999 | 200 | 0.0033 | - |
160
+ | 0.1248 | 250 | 0.0021 | - |
161
+ | 0.1498 | 300 | 0.0012 | - |
162
+ | 0.1747 | 350 | 0.0008 | - |
163
+ | 0.1997 | 400 | 0.0004 | - |
164
+ | 0.2247 | 450 | 0.0006 | - |
165
+ | 0.2496 | 500 | 0.0005 | - |
166
+ | 0.2746 | 550 | 0.0003 | - |
167
+ | 0.2996 | 600 | 0.0003 | - |
168
+ | 0.3245 | 650 | 0.0003 | - |
169
+ | 0.3495 | 700 | 0.0004 | - |
170
+ | 0.3744 | 750 | 0.0005 | - |
171
+ | 0.3994 | 800 | 0.0003 | - |
172
+ | 0.4244 | 850 | 0.0002 | - |
173
+ | 0.4493 | 900 | 0.0002 | - |
174
+ | 0.4743 | 950 | 0.0002 | - |
175
+ | 0.4993 | 1000 | 0.0001 | - |
176
+ | 0.5242 | 1050 | 0.0001 | - |
177
+ | 0.5492 | 1100 | 0.0001 | - |
178
+ | 0.5741 | 1150 | 0.0002 | - |
179
+ | 0.5991 | 1200 | 0.0001 | - |
180
+ | 0.6241 | 1250 | 0.0003 | - |
181
+ | 0.6490 | 1300 | 0.0002 | - |
182
+ | 0.6740 | 1350 | 0.0001 | - |
183
+ | 0.6990 | 1400 | 0.0003 | - |
184
+ | 0.7239 | 1450 | 0.0001 | - |
185
+ | 0.7489 | 1500 | 0.0002 | - |
186
+ | 0.7738 | 1550 | 0.0001 | - |
187
+ | 0.7988 | 1600 | 0.0002 | - |
188
+ | 0.8238 | 1650 | 0.0002 | - |
189
+ | 0.8487 | 1700 | 0.0002 | - |
190
+ | 0.8737 | 1750 | 0.0002 | - |
191
+ | 0.8987 | 1800 | 0.0003 | - |
192
+ | 0.9236 | 1850 | 0.0001 | - |
193
+ | 0.9486 | 1900 | 0.0001 | - |
194
+ | 0.9735 | 1950 | 0.0001 | - |
195
+ | 0.9985 | 2000 | 0.0001 | - |
196
+ | **1.0** | **2003** | **-** | **0.1735** |
197
+ | 1.0235 | 2050 | 0.0001 | - |
198
+ | 1.0484 | 2100 | 0.0001 | - |
199
+ | 1.0734 | 2150 | 0.0001 | - |
200
+ | 1.0984 | 2200 | 0.0 | - |
201
+ | 1.1233 | 2250 | 0.0001 | - |
202
+ | 1.1483 | 2300 | 0.0001 | - |
203
+ | 1.1732 | 2350 | 0.0001 | - |
204
+ | 1.1982 | 2400 | 0.0002 | - |
205
+ | 1.2232 | 2450 | 0.0001 | - |
206
+ | 1.2481 | 2500 | 0.0 | - |
207
+ | 1.2731 | 2550 | 0.0001 | - |
208
+ | 1.2981 | 2600 | 0.0001 | - |
209
+ | 1.3230 | 2650 | 0.0 | - |
210
+ | 1.3480 | 2700 | 0.0001 | - |
211
+ | 1.3729 | 2750 | 0.0001 | - |
212
+ | 1.3979 | 2800 | 0.0001 | - |
213
+ | 1.4229 | 2850 | 0.0 | - |
214
+ | 1.4478 | 2900 | 0.0001 | - |
215
+ | 1.4728 | 2950 | 0.0001 | - |
216
+ | 1.4978 | 3000 | 0.0001 | - |
217
+ | 1.5227 | 3050 | 0.0001 | - |
218
+ | 1.5477 | 3100 | 0.0 | - |
219
+ | 1.5726 | 3150 | 0.0 | - |
220
+ | 1.5976 | 3200 | 0.0001 | - |
221
+ | 1.6226 | 3250 | 0.0001 | - |
222
+ | 1.6475 | 3300 | 0.0001 | - |
223
+ | 1.6725 | 3350 | 0.0001 | - |
224
+ | 1.6975 | 3400 | 0.0001 | - |
225
+ | 1.7224 | 3450 | 0.0 | - |
226
+ | 1.7474 | 3500 | 0.0002 | - |
227
+ | 1.7723 | 3550 | 0.0001 | - |
228
+ | 1.7973 | 3600 | 0.0 | - |
229
+ | 1.8223 | 3650 | 0.0 | - |
230
+ | 1.8472 | 3700 | 0.0001 | - |
231
+ | 1.8722 | 3750 | 0.0 | - |
232
+ | 1.8972 | 3800 | 0.0001 | - |
233
+ | 1.9221 | 3850 | 0.0 | - |
234
+ | 1.9471 | 3900 | 0.0 | - |
235
+ | 1.9720 | 3950 | 0.0001 | - |
236
+ | 1.9970 | 4000 | 0.0 | - |
237
+ | 2.0 | 4006 | - | 0.2593 |
238
+ | 2.0220 | 4050 | 0.0001 | - |
239
+ | 2.0469 | 4100 | 0.0001 | - |
240
+ | 2.0719 | 4150 | 0.0 | - |
241
+ | 2.0969 | 4200 | 0.0001 | - |
242
+ | 2.1218 | 4250 | 0.0 | - |
243
+ | 2.1468 | 4300 | 0.0001 | - |
244
+ | 2.1717 | 4350 | 0.0001 | - |
245
+ | 2.1967 | 4400 | 0.0001 | - |
246
+ | 2.2217 | 4450 | 0.0001 | - |
247
+ | 2.2466 | 4500 | 0.0001 | - |
248
+ | 2.2716 | 4550 | 0.0 | - |
249
+ | 2.2966 | 4600 | 0.0 | - |
250
+ | 2.3215 | 4650 | 0.0 | - |
251
+ | 2.3465 | 4700 | 0.0001 | - |
252
+ | 2.3714 | 4750 | 0.0001 | - |
253
+ | 2.3964 | 4800 | 0.0002 | - |
254
+ | 2.4214 | 4850 | 0.0001 | - |
255
+ | 2.4463 | 4900 | 0.0001 | - |
256
+ | 2.4713 | 4950 | 0.0 | - |
257
+ | 2.4963 | 5000 | 0.0001 | - |
258
+ | 2.5212 | 5050 | 0.0001 | - |
259
+ | 2.5462 | 5100 | 0.0 | - |
260
+ | 2.5711 | 5150 | 0.0001 | - |
261
+ | 2.5961 | 5200 | 0.0 | - |
262
+ | 2.6211 | 5250 | 0.0 | - |
263
+ | 2.6460 | 5300 | 0.0 | - |
264
+ | 2.6710 | 5350 | 0.0 | - |
265
+ | 2.6960 | 5400 | 0.0 | - |
266
+ | 2.7209 | 5450 | 0.0 | - |
267
+ | 2.7459 | 5500 | 0.0 | - |
268
+ | 2.7708 | 5550 | 0.0 | - |
269
+ | 2.7958 | 5600 | 0.0001 | - |
270
+ | 2.8208 | 5650 | 0.0 | - |
271
+ | 2.8457 | 5700 | 0.0 | - |
272
+ | 2.8707 | 5750 | 0.0 | - |
273
+ | 2.8957 | 5800 | 0.0 | - |
274
+ | 2.9206 | 5850 | 0.0 | - |
275
+ | 2.9456 | 5900 | 0.0001 | - |
276
+ | 2.9705 | 5950 | 0.0 | - |
277
+ | 2.9955 | 6000 | 0.0 | - |
278
+ | 3.0 | 6009 | - | 0.2738 |
279
 
280
  * The bold row denotes the saved checkpoint.
281
  ### Framework Versions
config.json CHANGED
@@ -1,24 +1,28 @@
1
  {
2
- "_name_or_path": "checkpoints/step_1452",
3
  "architectures": [
4
- "MPNetModel"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
  "bos_token_id": 0,
 
8
  "eos_token_id": 2,
9
  "hidden_act": "gelu",
10
  "hidden_dropout_prob": 0.1,
11
- "hidden_size": 768,
12
  "initializer_range": 0.02,
13
- "intermediate_size": 3072,
14
  "layer_norm_eps": 1e-05,
15
- "max_position_embeddings": 514,
16
- "model_type": "mpnet",
17
- "num_attention_heads": 12,
18
- "num_hidden_layers": 12,
 
19
  "pad_token_id": 1,
20
- "relative_attention_num_buckets": 32,
21
  "torch_dtype": "float32",
22
  "transformers_version": "4.39.0",
23
- "vocab_size": 30527
 
 
24
  }
 
1
  {
2
+ "_name_or_path": "checkpoints/step_2003",
3
  "architectures": [
4
+ "XLMRobertaModel"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
  "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
  "eos_token_id": 2,
10
  "hidden_act": "gelu",
11
  "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
  "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
  "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 8194,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "output_past": true,
21
  "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
  "torch_dtype": "float32",
24
  "transformers_version": "4.39.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 250002
28
  }
config_sentence_transformers.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "__version__": {
3
- "sentence_transformers": "2.0.0",
4
- "transformers": "4.6.1",
5
- "pytorch": "1.8.1"
6
  },
7
  "prompts": {},
8
  "default_prompt_name": null
 
1
  {
2
  "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.33.0",
5
+ "pytorch": "2.1.2+cu121"
6
  },
7
  "prompts": {},
8
  "default_prompt_name": null
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e3aecb73df1f4899ecef5a4732119cf7319250902ae704c7a28e0ffc99a99eda
3
- size 437967672
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c53fbde2e0a8e51b9e5ba603737ededaa63700e00099ac3a1c04747711b4d59
3
+ size 2271064456
model_head.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:03051f677d184f53e2f1ee6703c4aba2d3bf9688700c5baa2bcd55b8790363d4
3
- size 7039
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ca465fc25015364d0a19db7bd6ca20b2f8dfd3fa876e23b415f2fcd1959f980
3
+ size 9087
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 384,
3
  "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 8192,
3
  "do_lower_case": false
4
  }
special_tokens_map.json CHANGED
@@ -42,7 +42,7 @@
42
  "single_word": false
43
  },
44
  "unk_token": {
45
- "content": "[UNK]",
46
  "lstrip": false,
47
  "normalized": false,
48
  "rstrip": false,
 
42
  "single_word": false
43
  },
44
  "unk_token": {
45
+ "content": "<unk>",
46
  "lstrip": false,
47
  "normalized": false,
48
  "rstrip": false,
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:46fb1f735006f52c1f9744bb05f7bc1544ec8475955af30396212c7737558d1e
3
- size 710932
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1af481bd08ed9347cf9d3d07c24e5de75a10983819de076436400609e6705686
3
+ size 17083075
tokenizer_config.json CHANGED
@@ -27,20 +27,12 @@
27
  "3": {
28
  "content": "<unk>",
29
  "lstrip": false,
30
- "normalized": true,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "104": {
36
- "content": "[UNK]",
37
- "lstrip": false,
38
  "normalized": false,
39
  "rstrip": false,
40
  "single_word": false,
41
  "special": true
42
  },
43
- "30526": {
44
  "content": "<mask>",
45
  "lstrip": true,
46
  "normalized": false,
@@ -52,21 +44,19 @@
52
  "bos_token": "<s>",
53
  "clean_up_tokenization_spaces": true,
54
  "cls_token": "<s>",
55
- "do_lower_case": true,
56
  "eos_token": "</s>",
57
  "mask_token": "<mask>",
58
- "max_length": 128,
59
- "model_max_length": 512,
60
  "pad_to_multiple_of": null,
61
  "pad_token": "<pad>",
62
  "pad_token_type_id": 0,
63
  "padding_side": "right",
64
  "sep_token": "</s>",
 
65
  "stride": 0,
66
- "strip_accents": null,
67
- "tokenize_chinese_chars": true,
68
- "tokenizer_class": "MPNetTokenizer",
69
  "truncation_side": "right",
70
  "truncation_strategy": "longest_first",
71
- "unk_token": "[UNK]"
72
  }
 
27
  "3": {
28
  "content": "<unk>",
29
  "lstrip": false,
 
 
 
 
 
 
 
 
30
  "normalized": false,
31
  "rstrip": false,
32
  "single_word": false,
33
  "special": true
34
  },
35
+ "250001": {
36
  "content": "<mask>",
37
  "lstrip": true,
38
  "normalized": false,
 
44
  "bos_token": "<s>",
45
  "clean_up_tokenization_spaces": true,
46
  "cls_token": "<s>",
 
47
  "eos_token": "</s>",
48
  "mask_token": "<mask>",
49
+ "max_length": 8192,
50
+ "model_max_length": 8192,
51
  "pad_to_multiple_of": null,
52
  "pad_token": "<pad>",
53
  "pad_token_type_id": 0,
54
  "padding_side": "right",
55
  "sep_token": "</s>",
56
+ "sp_model_kwargs": {},
57
  "stride": 0,
58
+ "tokenizer_class": "XLMRobertaTokenizer",
 
 
59
  "truncation_side": "right",
60
  "truncation_strategy": "longest_first",
61
+ "unk_token": "<unk>"
62
  }