Spaces:
Runtime error
Runtime error
Add README.md configuration
Browse files
README.md
CHANGED
@@ -1,3 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# SummVis
|
2 |
|
3 |
SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
|
@@ -95,14 +105,7 @@ is omitted for copyright reasons). The `preprocessing.py` script can be used for
|
|
95 |
|
96 |
#### Deanonymize 10 examples:
|
97 |
```shell
|
98 |
-
python preprocessing.py
|
99 |
-
--deanonymize \
|
100 |
-
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
|
101 |
-
--dataset cnn_dailymail \
|
102 |
-
--version 3.0.0 \
|
103 |
-
--split validation \
|
104 |
-
--processed_dataset_path data/10:cnn_dailymail_1000.validation \
|
105 |
-
--n_samples 10
|
106 |
```
|
107 |
This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
|
108 |
the Datasets library.
|
@@ -149,48 +152,22 @@ Set the `--n_samples` argument and name the `--processed_dataset_path` output fi
|
|
149 |
|
150 |
#### Example: Deanonymize 100 examples from CNN / Daily Mail:
|
151 |
```shell
|
152 |
-
python preprocessing.py
|
153 |
-
--deanonymize \
|
154 |
-
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
|
155 |
-
--dataset cnn_dailymail \
|
156 |
-
--version 3.0.0 \
|
157 |
-
--split validation \
|
158 |
-
--processed_dataset_path data/100:cnn_dailymail_1000.validation \
|
159 |
-
--n_samples 100
|
160 |
```
|
161 |
|
162 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
|
163 |
```shell
|
164 |
-
python preprocessing.py
|
165 |
-
--deanonymize \
|
166 |
-
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
|
167 |
-
--dataset cnn_dailymail \
|
168 |
-
--version 3.0.0 \
|
169 |
-
--split validation \
|
170 |
-
--processed_dataset_path data/full:cnn_dailymail_1000.validation \
|
171 |
-
--n_samples 1000
|
172 |
```
|
173 |
|
174 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
|
175 |
```shell
|
176 |
-
python preprocessing.py
|
177 |
-
--deanonymize \
|
178 |
-
--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
|
179 |
-
--dataset cnn_dailymail \
|
180 |
-
--version 3.0.0 \
|
181 |
-
--split validation \
|
182 |
-
--processed_dataset_path data/full:cnn_dailymail.validation
|
183 |
```
|
184 |
|
185 |
#### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
|
186 |
```shell
|
187 |
-
python preprocessing.py
|
188 |
-
--deanonymize \
|
189 |
-
--dataset_rg preprocessing/xsum_1000.validation.anonymized \
|
190 |
-
--dataset xsum \
|
191 |
-
--split validation \
|
192 |
-
--processed_dataset_path data/full:xsum_1000.validation \
|
193 |
-
--n_samples 1000
|
194 |
```
|
195 |
|
196 |
### 3. Run SummVis
|
@@ -244,10 +221,7 @@ You may run `preprocessing.py` to precompute all data required in the interface
|
|
244 |
|
245 |
1. Run preprocessing script to generate cache file
|
246 |
```shell
|
247 |
-
python preprocessing.py
|
248 |
-
--workflow \
|
249 |
-
--dataset_jsonl path/to/my_dataset.jsonl \
|
250 |
-
--processed_dataset_path path/to/my_cache_file
|
251 |
```
|
252 |
You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
|
253 |
|
@@ -278,20 +252,12 @@ standardized format with columns for `document` and `summary:reference`.
|
|
278 |
|
279 |
##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
|
280 |
```shell
|
281 |
-
python preprocessing.py
|
282 |
-
--standardize \
|
283 |
-
--dataset cnn_dailymail \
|
284 |
-
--version 3.0.0 \
|
285 |
-
--split validation \
|
286 |
-
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
287 |
```
|
288 |
|
289 |
##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
|
290 |
```shell
|
291 |
-
python preprocessing.py
|
292 |
-
--standardize \
|
293 |
-
--dataset_jsonl path/to/my_dataset.jsonl \
|
294 |
-
--save_jsonl_path preprocessing/my_dataset.jsonl
|
295 |
```
|
296 |
|
297 |
Expected format of `my_dataset.jsonl`:
|
@@ -313,17 +279,7 @@ You may also generate your own predictions using this [this script](generation.p
|
|
313 |
|
314 |
##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
|
315 |
```shell
|
316 |
-
python preprocessing.py
|
317 |
-
--join_predictions \
|
318 |
-
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
|
319 |
-
--prediction_jsonls \
|
320 |
-
predictions/bart-cnndm.cnndm.validation.results.anonymized \
|
321 |
-
predictions/bart-xsum.cnndm.validation.results.anonymized \
|
322 |
-
predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
|
323 |
-
predictions/pegasus-multinews.cnndm.validation.results.anonymized \
|
324 |
-
predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
|
325 |
-
predictions/pegasus-xsum.cnndm.validation.results.anonymized \
|
326 |
-
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
327 |
```
|
328 |
|
329 |
#### 3. Run the preprocessing workflow and save the dataset.
|
@@ -333,19 +289,12 @@ and stores the processed dataset back to disk.
|
|
333 |
|
334 |
##### Example: Autorun with default settings on a few examples to try it.
|
335 |
```shell
|
336 |
-
python preprocessing.py
|
337 |
-
--workflow \
|
338 |
-
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
|
339 |
-
--processed_dataset_path data/cnn_dailymail.validation \
|
340 |
-
--try_it
|
341 |
```
|
342 |
|
343 |
##### Example: Autorun with default settings on all examples.
|
344 |
```shell
|
345 |
-
python preprocessing.py
|
346 |
-
--workflow \
|
347 |
-
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
|
348 |
-
--processed_dataset_path data/cnn_dailymail
|
349 |
```
|
350 |
|
351 |
|
|
|
1 |
+
---
|
2 |
+
title: Summvis
|
3 |
+
emoji: π
|
4 |
+
colorFrom: yellow
|
5 |
+
colorTo: green
|
6 |
+
sdk: streamlit
|
7 |
+
app_file: app.py
|
8 |
+
pinned: false
|
9 |
+
---
|
10 |
+
|
11 |
# SummVis
|
12 |
|
13 |
SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
|
|
|
105 |
|
106 |
#### Deanonymize 10 examples:
|
107 |
```shell
|
108 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/10:cnn_dailymail_1000.validation \\n--n_samples 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
```
|
110 |
This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
|
111 |
the Datasets library.
|
|
|
152 |
|
153 |
#### Example: Deanonymize 100 examples from CNN / Daily Mail:
|
154 |
```shell
|
155 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/100:cnn_dailymail_1000.validation \\n--n_samples 100
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
```
|
157 |
|
158 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
|
159 |
```shell
|
160 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail_1000.validation \\n--n_samples 1000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
161 |
```
|
162 |
|
163 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
|
164 |
```shell
|
165 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail.validation
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
```
|
167 |
|
168 |
#### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
|
169 |
```shell
|
170 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/xsum_1000.validation.anonymized \\n--dataset xsum \\n--split validation \\n--processed_dataset_path data/full:xsum_1000.validation \\n--n_samples 1000
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
```
|
172 |
|
173 |
### 3. Run SummVis
|
|
|
221 |
|
222 |
1. Run preprocessing script to generate cache file
|
223 |
```shell
|
224 |
+
python preprocessing.py \\n --workflow \\n --dataset_jsonl path/to/my_dataset.jsonl \\n --processed_dataset_path path/to/my_cache_file
|
|
|
|
|
|
|
225 |
```
|
226 |
You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
|
227 |
|
|
|
252 |
|
253 |
##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
|
254 |
```shell
|
255 |
+
python preprocessing.py \\n--standardize \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
|
|
|
|
|
|
|
|
|
|
256 |
```
|
257 |
|
258 |
##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
|
259 |
```shell
|
260 |
+
python preprocessing.py \\n--standardize \\n--dataset_jsonl path/to/my_dataset.jsonl \\n--save_jsonl_path preprocessing/my_dataset.jsonl
|
|
|
|
|
|
|
261 |
```
|
262 |
|
263 |
Expected format of `my_dataset.jsonl`:
|
|
|
279 |
|
280 |
##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
|
281 |
```shell
|
282 |
+
python preprocessing.py \\n--join_predictions \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--prediction_jsonls \\npredictions/bart-cnndm.cnndm.validation.results.anonymized \\npredictions/bart-xsum.cnndm.validation.results.anonymized \\npredictions/pegasus-cnndm.cnndm.validation.results.anonymized \\npredictions/pegasus-multinews.cnndm.validation.results.anonymized \\npredictions/pegasus-newsroom.cnndm.validation.results.anonymized \\npredictions/pegasus-xsum.cnndm.validation.results.anonymized \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
283 |
```
|
284 |
|
285 |
#### 3. Run the preprocessing workflow and save the dataset.
|
|
|
289 |
|
290 |
##### Example: Autorun with default settings on a few examples to try it.
|
291 |
```shell
|
292 |
+
python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail.validation \\n--try_it
|
|
|
|
|
|
|
|
|
293 |
```
|
294 |
|
295 |
##### Example: Autorun with default settings on all examples.
|
296 |
```shell
|
297 |
+
python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail
|
|
|
|
|
|
|
298 |
```
|
299 |
|
300 |
|