cbensimon HF staff commited on
Commit
68e5edd
β€’
1 Parent(s): 6124176

Add README.md configuration

Browse files
Files changed (1) hide show
  1. README.md +21 -72
README.md CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  # SummVis
2
 
3
  SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
@@ -95,14 +105,7 @@ is omitted for copyright reasons). The `preprocessing.py` script can be used for
95
 
96
  #### Deanonymize 10 examples:
97
  ```shell
98
- python preprocessing.py \
99
- --deanonymize \
100
- --dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
101
- --dataset cnn_dailymail \
102
- --version 3.0.0 \
103
- --split validation \
104
- --processed_dataset_path data/10:cnn_dailymail_1000.validation \
105
- --n_samples 10
106
  ```
107
  This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
108
  the Datasets library.
@@ -149,48 +152,22 @@ Set the `--n_samples` argument and name the `--processed_dataset_path` output fi
149
 
150
  #### Example: Deanonymize 100 examples from CNN / Daily Mail:
151
  ```shell
152
- python preprocessing.py \
153
- --deanonymize \
154
- --dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
155
- --dataset cnn_dailymail \
156
- --version 3.0.0 \
157
- --split validation \
158
- --processed_dataset_path data/100:cnn_dailymail_1000.validation \
159
- --n_samples 100
160
  ```
161
 
162
  #### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
163
  ```shell
164
- python preprocessing.py \
165
- --deanonymize \
166
- --dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
167
- --dataset cnn_dailymail \
168
- --version 3.0.0 \
169
- --split validation \
170
- --processed_dataset_path data/full:cnn_dailymail_1000.validation \
171
- --n_samples 1000
172
  ```
173
 
174
  #### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
175
  ```shell
176
- python preprocessing.py \
177
- --deanonymize \
178
- --dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
179
- --dataset cnn_dailymail \
180
- --version 3.0.0 \
181
- --split validation \
182
- --processed_dataset_path data/full:cnn_dailymail.validation
183
  ```
184
 
185
  #### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
186
  ```shell
187
- python preprocessing.py \
188
- --deanonymize \
189
- --dataset_rg preprocessing/xsum_1000.validation.anonymized \
190
- --dataset xsum \
191
- --split validation \
192
- --processed_dataset_path data/full:xsum_1000.validation \
193
- --n_samples 1000
194
  ```
195
 
196
  ### 3. Run SummVis
@@ -244,10 +221,7 @@ You may run `preprocessing.py` to precompute all data required in the interface
244
 
245
  1. Run preprocessing script to generate cache file
246
  ```shell
247
- python preprocessing.py \
248
- --workflow \
249
- --dataset_jsonl path/to/my_dataset.jsonl \
250
- --processed_dataset_path path/to/my_cache_file
251
  ```
252
  You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
253
 
@@ -278,20 +252,12 @@ standardized format with columns for `document` and `summary:reference`.
278
 
279
  ##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
280
  ```shell
281
- python preprocessing.py \
282
- --standardize \
283
- --dataset cnn_dailymail \
284
- --version 3.0.0 \
285
- --split validation \
286
- --save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
287
  ```
288
 
289
  ##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
290
  ```shell
291
- python preprocessing.py \
292
- --standardize \
293
- --dataset_jsonl path/to/my_dataset.jsonl \
294
- --save_jsonl_path preprocessing/my_dataset.jsonl
295
  ```
296
 
297
  Expected format of `my_dataset.jsonl`:
@@ -313,17 +279,7 @@ You may also generate your own predictions using this [this script](generation.p
313
 
314
  ##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
315
  ```shell
316
- python preprocessing.py \
317
- --join_predictions \
318
- --dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
319
- --prediction_jsonls \
320
- predictions/bart-cnndm.cnndm.validation.results.anonymized \
321
- predictions/bart-xsum.cnndm.validation.results.anonymized \
322
- predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
323
- predictions/pegasus-multinews.cnndm.validation.results.anonymized \
324
- predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
325
- predictions/pegasus-xsum.cnndm.validation.results.anonymized \
326
- --save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
327
  ```
328
 
329
  #### 3. Run the preprocessing workflow and save the dataset.
@@ -333,19 +289,12 @@ and stores the processed dataset back to disk.
333
 
334
  ##### Example: Autorun with default settings on a few examples to try it.
335
  ```shell
336
- python preprocessing.py \
337
- --workflow \
338
- --dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
339
- --processed_dataset_path data/cnn_dailymail.validation \
340
- --try_it
341
  ```
342
 
343
  ##### Example: Autorun with default settings on all examples.
344
  ```shell
345
- python preprocessing.py \
346
- --workflow \
347
- --dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
348
- --processed_dataset_path data/cnn_dailymail
349
  ```
350
 
351
 
 
1
+ ---
2
+ title: Summvis
3
+ emoji: πŸ“š
4
+ colorFrom: yellow
5
+ colorTo: green
6
+ sdk: streamlit
7
+ app_file: app.py
8
+ pinned: false
9
+ ---
10
+
11
  # SummVis
12
 
13
  SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
 
105
 
106
  #### Deanonymize 10 examples:
107
  ```shell
108
+ python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/10:cnn_dailymail_1000.validation \\n--n_samples 10
 
 
 
 
 
 
 
109
  ```
110
  This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
111
  the Datasets library.
 
152
 
153
  #### Example: Deanonymize 100 examples from CNN / Daily Mail:
154
  ```shell
155
+ python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/100:cnn_dailymail_1000.validation \\n--n_samples 100
 
 
 
 
 
 
 
156
  ```
157
 
158
  #### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
159
  ```shell
160
+ python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail_1000.validation \\n--n_samples 1000
 
 
 
 
 
 
 
161
  ```
162
 
163
  #### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
164
  ```shell
165
+ python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail.validation
 
 
 
 
 
 
166
  ```
167
 
168
  #### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
169
  ```shell
170
+ python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/xsum_1000.validation.anonymized \\n--dataset xsum \\n--split validation \\n--processed_dataset_path data/full:xsum_1000.validation \\n--n_samples 1000
 
 
 
 
 
 
171
  ```
172
 
173
  ### 3. Run SummVis
 
221
 
222
  1. Run preprocessing script to generate cache file
223
  ```shell
224
+ python preprocessing.py \\n --workflow \\n --dataset_jsonl path/to/my_dataset.jsonl \\n --processed_dataset_path path/to/my_cache_file
 
 
 
225
  ```
226
  You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
227
 
 
252
 
253
  ##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
254
  ```shell
255
+ python preprocessing.py \\n--standardize \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
 
 
 
 
 
256
  ```
257
 
258
  ##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
259
  ```shell
260
+ python preprocessing.py \\n--standardize \\n--dataset_jsonl path/to/my_dataset.jsonl \\n--save_jsonl_path preprocessing/my_dataset.jsonl
 
 
 
261
  ```
262
 
263
  Expected format of `my_dataset.jsonl`:
 
279
 
280
  ##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
281
  ```shell
282
+ python preprocessing.py \\n--join_predictions \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--prediction_jsonls \\npredictions/bart-cnndm.cnndm.validation.results.anonymized \\npredictions/bart-xsum.cnndm.validation.results.anonymized \\npredictions/pegasus-cnndm.cnndm.validation.results.anonymized \\npredictions/pegasus-multinews.cnndm.validation.results.anonymized \\npredictions/pegasus-newsroom.cnndm.validation.results.anonymized \\npredictions/pegasus-xsum.cnndm.validation.results.anonymized \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
 
 
 
 
 
 
 
 
 
 
283
  ```
284
 
285
  #### 3. Run the preprocessing workflow and save the dataset.
 
289
 
290
  ##### Example: Autorun with default settings on a few examples to try it.
291
  ```shell
292
+ python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail.validation \\n--try_it
 
 
 
 
293
  ```
294
 
295
  ##### Example: Autorun with default settings on all examples.
296
  ```shell
297
+ python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail
 
 
 
298
  ```
299
 
300