Spaces:
Sleeping
Sleeping
Fix a crazy bug and add README
Browse filesSome tokens
after the punctuation has been removed become joint into a single token,
for instance @.@ becomes @@,
which we then cannot find in the original string.
So switch from using find,
to using the end indices
and keeping track of punctuation we removed from the original string.
- README.md +97 -7
- commafixer/src/baseline.py +100 -21
- notebooks/evaluation.ipynb +9 -9
- openapi.yaml +3 -4
- setup.py +1 -1
- tests/test_baseline.py +13 -4
README.md
CHANGED
@@ -10,26 +10,116 @@ pinned: true
|
|
10 |
app_port: 8000
|
11 |
---
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
`sudo service docker start`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
-
`docker log [id]` for logs from the container.
|
16 |
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
-
|
|
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
| English | German | French | Italian |
|
27 |
|---------|--------|--------|---------|
|
28 |
| 0.819 | 0.945 | 0.831 | 0.798 |
|
29 |
|
30 |
-
|
|
|
31 |
|
32 |
| precision | recall | F1 | support |
|
33 |
|-----------|--------|------|---------|
|
34 |
| 0.79 | 0.71 | 0.75 | 10079 |
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
app_port: 8000
|
11 |
---
|
12 |
|
13 |
+
# Comma fixer
|
14 |
+
This repository contains a web service for fixing comma placement within a given text, for instance:
|
15 |
+
|
16 |
+
`"A sentence however, not quite good correct and sound."` -> `"A sentence, however, not quite good, correct and sound."`
|
17 |
+
|
18 |
+
It provides a webpage for testing the functionality, a REST API,
|
19 |
+
and Jupyter notebooks for evaluating and training comma fixing models.
|
20 |
+
|
21 |
+
A web demo is hosted in the [huggingface spaces](https://huggingface.co/spaces/klasocki/comma-fixer).
|
22 |
+
|
23 |
+
## Development setup
|
24 |
+
|
25 |
+
Deploying the service for local development can be done by running `docker-compose up` in the root directory.
|
26 |
+
Note that you might have to
|
27 |
`sudo service docker start`
|
28 |
+
first.
|
29 |
+
|
30 |
+
The application should then be available at http://localhost:8000.
|
31 |
+
For the API, see the `openapi.yaml` file.
|
32 |
+
Docker-compose mounts a volume and listens to changes in the source code, so the application will be reloaded and
|
33 |
+
reflect them.
|
34 |
+
|
35 |
+
We use multi-stage builds to reduce the image size, ensure flexibility in requirements and that tests are run before
|
36 |
+
each deployment.
|
37 |
+
However, while it does reduce the size by nearly 3GB, the resulting image still contains deep learning libraries and
|
38 |
+
pre-downloaded models, and will take around 7GB of disk space.
|
39 |
+
|
40 |
+
Alternatively, you can setup a python environment by hand. It is recommended to use a virtualenv. Inside one, run
|
41 |
+
```bash
|
42 |
+
pip install -e .[test]
|
43 |
+
```
|
44 |
+
the `[test]` option makes sure to install test dependencies.
|
45 |
+
|
46 |
+
If you intend to perform training and evaluation of deep learning models, install also using the `[training]` option.
|
47 |
+
|
48 |
+
### Running tests
|
49 |
+
To run the tests, execute
|
50 |
+
```bash
|
51 |
+
docker build -t comma-fixer --target test .
|
52 |
+
```
|
53 |
+
Or `python -m pytest tests/ ` if you already have a local python environment.
|
54 |
|
|
|
55 |
|
56 |
+
### Deploying to huggingface spaces
|
57 |
+
In order to deploy the application, one needs to be added as a collaborator to the space and have set up a
|
58 |
+
corresponding git remote.
|
59 |
+
The application is then continuously deployed on each push.
|
60 |
+
```bash
|
61 |
+
git remote add hub https://huggingface.co/spaces/klasocki/comma-fixer
|
62 |
+
git push hub
|
63 |
+
```
|
64 |
|
65 |
+
## Evaluation
|
66 |
|
67 |
+
In order to evaluate, run `jupyter notebook notebooks/` or copy the notebooks to a web hosting service with GPUs,
|
68 |
+
such as Google Colab or Kaggle
|
69 |
+
and clone this repository there.
|
70 |
|
71 |
+
We use the [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large)
|
72 |
+
model as the baseline.
|
73 |
+
It is a RoBERTa large model fine-tuned for the task of punctuation restoration on a dataset of political speeches
|
74 |
+
in English, German, French and Italian.
|
75 |
+
That is, it takes a sentence without any punctuation as input, and predicts the missing punctuation in token
|
76 |
+
classification fashion, thanks to which the original token structure stays unchanged.
|
77 |
+
We use a subset of its capabilities focusing solely on commas, and leaving other punctuation intact.
|
78 |
+
|
79 |
+
|
80 |
+
|
81 |
+
The authors report the following token classification F1 scores on commas for different languages on the original
|
82 |
+
dataset:
|
83 |
|
84 |
| English | German | French | Italian |
|
85 |
|---------|--------|--------|---------|
|
86 |
| 0.819 | 0.945 | 0.831 | 0.798 |
|
87 |
|
88 |
+
The results of our evaluation of the baseline model out of domain on the English wikitext-103-raw-v1 validation
|
89 |
+
dataset are as follows:
|
90 |
|
91 |
| precision | recall | F1 | support |
|
92 |
|-----------|--------|------|---------|
|
93 |
| 0.79 | 0.71 | 0.75 | 10079 |
|
94 |
|
95 |
+
We treat each comma as one token instance, as opposed to the original paper, which NER-tags the whole multiple-token
|
96 |
+
preceding words as comma class tokens.
|
97 |
+
In our approach, for each comma from the prediction text obtained from the model:
|
98 |
+
* If it should be there according to ground truth, it counts as a true positive.
|
99 |
+
* If it should not be there, it counts as a false positive.
|
100 |
+
* If a comma from ground truth is not predicted, it counts as a false negative.
|
101 |
+
|
102 |
+
## Training
|
103 |
+
While fine-tuning an encoder BERT-like pre-trained model for NER seems like the best approach to the problem,
|
104 |
+
since it preserves the sentence structure and only focuses on commas,
|
105 |
+
with limited GPU resources, we doubt we could beat the baseline model with a similar approach.
|
106 |
+
We could fine-tune the baseline on our data, focusing on commas, and see if it brings any improvement.
|
107 |
+
|
108 |
+
However, we thought that trying out pre-trained text-to-text or decoder-only LLMs for this task using PEFT could be
|
109 |
+
interesting, and wanted to check if we have enough resources for low-rank adaptation or prefix-tuning.
|
110 |
+
|
111 |
+
We adapt the code from [this tutorial](https://www.youtube.com/watch?v=iYr1xZn26R8) in order to fine-tune a
|
112 |
+
[bloom LLM](https://huggingface.co/bigscience/bloom-560m) to our task using
|
113 |
+
[LoRa](https://arxiv.org/pdf/2106.09685.pdf).
|
114 |
+
However, even with the smallest model from the family, we struggled with CUDA memory errors using the free Google
|
115 |
+
colab GPU quotas, and could only train with a batch size of two.
|
116 |
+
After a short training, it seems the loss keeps fluctuating and the model is only able to learn to repeat the
|
117 |
+
original phrase back.
|
118 |
+
|
119 |
+
If time permits, we plan to experiment with seq2seq pre-trained models, increasing gradient accumulation steps, and the
|
120 |
+
percentage of
|
121 |
+
data with commas.
|
122 |
+
The latter could help since wikitext contains highly diverse data, with many rows being empty strings,
|
123 |
+
headers, or short paragraphs.
|
124 |
+
|
125 |
+
|
commafixer/src/baseline.py
CHANGED
@@ -1,52 +1,131 @@
|
|
1 |
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline, NerPipeline
|
|
|
2 |
|
3 |
|
4 |
class BaselineCommaFixer:
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
def __init__(self, device=-1):
|
6 |
self._ner = _create_baseline_pipeline(device=device)
|
7 |
|
8 |
def fix_commas(self, s: str) -> str:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
return _fix_commas_based_on_pipeline_output(
|
10 |
-
self._ner(
|
11 |
-
s
|
|
|
12 |
)
|
13 |
|
14 |
|
15 |
def _create_baseline_pipeline(model_name="oliverguhr/fullstop-punctuation-multilang-large", device=-1) -> NerPipeline:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
17 |
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
18 |
return pipeline('ner', model=model, tokenizer=tokenizer, device=device)
|
19 |
|
20 |
|
21 |
-
def _remove_punctuation(s: str) -> str:
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
result = original_s.replace(',', '') # We will fix the commas, but keep everything else intact
|
30 |
-
|
|
|
|
|
31 |
|
32 |
for i in range(1, len(pipeline_json)):
|
33 |
-
current_offset =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
if _should_insert_comma(i, pipeline_json):
|
35 |
result = result[:current_offset] + ',' + result[current_offset:]
|
36 |
-
|
37 |
return result
|
38 |
|
39 |
|
40 |
-
def
|
41 |
-
|
42 |
-
|
|
|
|
|
|
|
43 |
|
|
|
|
|
|
|
|
|
44 |
|
45 |
-
|
46 |
-
|
47 |
-
#
|
48 |
-
|
49 |
-
return current_offset
|
50 |
|
51 |
|
52 |
if __name__ == "__main__":
|
|
|
1 |
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline, NerPipeline
|
2 |
+
import re
|
3 |
|
4 |
|
5 |
class BaselineCommaFixer:
|
6 |
+
"""
|
7 |
+
A wrapper class for the oliverguhr/fullstop-punctuation-multilang-large baseline punctuation restoration model.
|
8 |
+
It adapts the model to perform comma fixing instead of full punctuation restoration, that is, removes the
|
9 |
+
punctuation, runs the model, and then uses its outputs so that only commas are changed.
|
10 |
+
"""
|
11 |
+
|
12 |
def __init__(self, device=-1):
|
13 |
self._ner = _create_baseline_pipeline(device=device)
|
14 |
|
15 |
def fix_commas(self, s: str) -> str:
|
16 |
+
"""
|
17 |
+
The main method for fixing commas using the baseline model.
|
18 |
+
In the future we should think about batching the calls to it, for now it processes requests string by string.
|
19 |
+
:param s: A string with commas to fix, without length restrictions.
|
20 |
+
Example: comma_fixer.fix_commas("One two thre, and four!")
|
21 |
+
:return: A string with commas fixed, example: "One, two, thre and four!"
|
22 |
+
"""
|
23 |
+
s_no_punctuation, punctuation_indices = _remove_punctuation(s)
|
24 |
return _fix_commas_based_on_pipeline_output(
|
25 |
+
self._ner(s_no_punctuation),
|
26 |
+
s,
|
27 |
+
punctuation_indices
|
28 |
)
|
29 |
|
30 |
|
31 |
def _create_baseline_pipeline(model_name="oliverguhr/fullstop-punctuation-multilang-large", device=-1) -> NerPipeline:
|
32 |
+
"""
|
33 |
+
Creates the huggingface pipeline object.
|
34 |
+
Can also be used for pre-downloading the model and the tokenizer.
|
35 |
+
:param model_name: Name of the baseline model on the huggingface hub.
|
36 |
+
:param device: Device to use when running the pipeline, defaults to -1 for CPU, a higher number indicates the id
|
37 |
+
of GPU to use.
|
38 |
+
:return: A token classification pipeline.
|
39 |
+
"""
|
40 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
41 |
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
42 |
return pipeline('ner', model=model, tokenizer=tokenizer, device=device)
|
43 |
|
44 |
|
45 |
+
def _remove_punctuation(s: str) -> tuple[str, list[int]]:
|
46 |
+
"""
|
47 |
+
Removes the punctuation (".,?-:") from the input text, since the baseline model has been trained on data without
|
48 |
+
punctuation. It also keeps track of the indices where we remove it, so that we can restore the original later.
|
49 |
+
Commas are the exception, since we remove them, but restore with the model.
|
50 |
+
Hence we do not keep track of removed comma indices.
|
51 |
+
:param s: For instance, "A short-string: with punctuation, removed.
|
52 |
+
:return: A tuple of a string, for instance:
|
53 |
+
"A shortstring with punctuation removed"; and a list of indices where punctuation has been removed, in ascending
|
54 |
+
order
|
55 |
+
"""
|
56 |
+
to_remove_regex = r"[\.\?\-:]"
|
57 |
+
# We're not counting commas, since we will remove them later anyway. Only counting removals that will be restored
|
58 |
+
# in the final resulting string.
|
59 |
+
punctuation_indices = [m.start() for m in re.finditer(to_remove_regex, s)]
|
60 |
+
s = re.sub(to_remove_regex, '', s)
|
61 |
+
s = s.replace(',', '')
|
62 |
+
return s, punctuation_indices
|
63 |
+
|
64 |
+
|
65 |
+
def _fix_commas_based_on_pipeline_output(pipeline_json: list[dict], original_s: str, punctuation_indices: list[int]) -> \
|
66 |
+
str:
|
67 |
+
"""
|
68 |
+
This function takes the comma fixing token classification pipeline output, and converts it to string based on the
|
69 |
+
original
|
70 |
+
string and punctuation indices, so that the string contains all the original characters, except commas, intact.
|
71 |
+
:param pipeline_json: Token classification pipeline output.
|
72 |
+
Contains five fields.
|
73 |
+
'entity' is the punctuation that should follow this token.
|
74 |
+
'word' is the token text together with preceding space if any.
|
75 |
+
'end' is the end index in the original string (with punctuation removed in our case!!)
|
76 |
+
Example: [{'entity': ':',
|
77 |
+
'score': 0.90034866,
|
78 |
+
'index': 1,
|
79 |
+
'word': '▁Exam',
|
80 |
+
'start': 0,
|
81 |
+
'end': 4},
|
82 |
+
{'entity': ':',
|
83 |
+
'score': 0.9157294,
|
84 |
+
'index': 2,
|
85 |
+
'word': 'ple',
|
86 |
+
'start': 4,
|
87 |
+
'end': 7}]
|
88 |
+
:param original_s: The original string, before removing punctuation.
|
89 |
+
:param punctuation_indices: The indices of the removed punctuation except commas, so that we can correctly keep
|
90 |
+
track of the current offset in the original string.
|
91 |
+
:return: A string with commas fixed, and other the original punctuation from the input string restored.
|
92 |
+
"""
|
93 |
result = original_s.replace(',', '') # We will fix the commas, but keep everything else intact
|
94 |
+
|
95 |
+
commas_inserted_or_punctuation_removed = 0
|
96 |
+
removed_punctuation_index = 0
|
97 |
|
98 |
for i in range(1, len(pipeline_json)):
|
99 |
+
current_offset = pipeline_json[i - 1]['end'] + commas_inserted_or_punctuation_removed
|
100 |
+
|
101 |
+
commas_inserted_or_punctuation_removed, current_offset, removed_punctuation_index = (
|
102 |
+
_update_offset_by_the_removed_punctuation(
|
103 |
+
commas_inserted_or_punctuation_removed, current_offset, punctuation_indices, removed_punctuation_index
|
104 |
+
)
|
105 |
+
)
|
106 |
+
|
107 |
if _should_insert_comma(i, pipeline_json):
|
108 |
result = result[:current_offset] + ',' + result[current_offset:]
|
109 |
+
commas_inserted_or_punctuation_removed += 1
|
110 |
return result
|
111 |
|
112 |
|
113 |
+
def _update_offset_by_the_removed_punctuation(
|
114 |
+
commas_inserted_and_punctuation_removed, current_offset, punctuation_indices, removed_punctuation_index
|
115 |
+
):
|
116 |
+
# increase the counters for every punctuation removed from the original string before the curent offset
|
117 |
+
while (removed_punctuation_index < len(punctuation_indices) and
|
118 |
+
punctuation_indices[removed_punctuation_index] < current_offset):
|
119 |
|
120 |
+
commas_inserted_and_punctuation_removed += 1
|
121 |
+
removed_punctuation_index += 1
|
122 |
+
current_offset += 1
|
123 |
+
return commas_inserted_and_punctuation_removed, current_offset, removed_punctuation_index
|
124 |
|
125 |
+
|
126 |
+
def _should_insert_comma(i, pipeline_json, new_word_indicator='▁') -> bool:
|
127 |
+
# Only insert commas for the final token of a word, that is, if next word starts with a space.
|
128 |
+
return pipeline_json[i - 1]['entity'] == ',' and pipeline_json[i]['word'].startswith(new_word_indicator)
|
|
|
129 |
|
130 |
|
131 |
if __name__ == "__main__":
|
notebooks/evaluation.ipynb
CHANGED
@@ -283,15 +283,6 @@
|
|
283 |
}
|
284 |
]
|
285 |
},
|
286 |
-
{
|
287 |
-
"cell_type": "markdown",
|
288 |
-
"source": [
|
289 |
-
"We have 2 commas predicted correctly, and 3 missed, so we are expecting 100% precision and 40% recall."
|
290 |
-
],
|
291 |
-
"metadata": {
|
292 |
-
"id": "NzVo05UcoPlb"
|
293 |
-
}
|
294 |
-
},
|
295 |
{
|
296 |
"cell_type": "code",
|
297 |
"source": [
|
@@ -346,6 +337,15 @@
|
|
346 |
}
|
347 |
]
|
348 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
349 |
{
|
350 |
"cell_type": "code",
|
351 |
"source": [
|
|
|
283 |
}
|
284 |
]
|
285 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
286 |
{
|
287 |
"cell_type": "code",
|
288 |
"source": [
|
|
|
337 |
}
|
338 |
]
|
339 |
},
|
340 |
+
{
|
341 |
+
"cell_type": "markdown",
|
342 |
+
"source": [
|
343 |
+
"We have 2 commas predicted correctly, and 3 missed, so we are expecting 100% precision and 40% recall."
|
344 |
+
],
|
345 |
+
"metadata": {
|
346 |
+
"collapsed": false
|
347 |
+
}
|
348 |
+
},
|
349 |
{
|
350 |
"cell_type": "code",
|
351 |
"source": [
|
openapi.yaml
CHANGED
@@ -32,11 +32,10 @@ paths:
|
|
32 |
s:
|
33 |
type: string
|
34 |
example: 'This is a sentence with wrong commas, at least some.'
|
35 |
-
description: A text with commas fixed, or unchanged if not necessary.
|
36 |
-
|
37 |
-
TODO some other punctuation may be changed as well
|
38 |
|
39 |
400:
|
40 |
-
description:
|
41 |
|
42 |
|
|
|
32 |
s:
|
33 |
type: string
|
34 |
example: 'This is a sentence with wrong commas, at least some.'
|
35 |
+
description: A text with commas fixed, or unchanged if not necessary. Everything other that
|
36 |
+
commas will stay as it was originally.
|
|
|
37 |
|
38 |
400:
|
39 |
+
description: A required field missing from the POST request body JSON.
|
40 |
|
41 |
|
setup.py
CHANGED
@@ -21,7 +21,7 @@ setup(
|
|
21 |
extras_require={
|
22 |
'training': [
|
23 |
'datasets==2.14.4',
|
24 |
-
'seqeval'
|
25 |
'notebook'
|
26 |
],
|
27 |
'test': [
|
|
|
21 |
extras_require={
|
22 |
'training': [
|
23 |
'datasets==2.14.4',
|
24 |
+
'seqeval',
|
25 |
'notebook'
|
26 |
],
|
27 |
'test': [
|
tests/test_baseline.py
CHANGED
@@ -30,7 +30,15 @@ def test_fix_commas_leaves_correct_strings_unchanged(baseline_fixer, test_input)
|
|
30 |
['Even newlines\ntabs\tand others get preserved.',
|
31 |
'Even newlines,\ntabs\tand others get preserved.'],
|
32 |
['I had no Creativity left, therefore, I come here, and write useless examples, for this test.',
|
33 |
-
'I had no Creativity left therefore, I come here and write useless examples for this test.']
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
)
|
35 |
def test_fix_commas_fixes_incorrect_commas(baseline_fixer, test_input, expected):
|
36 |
result = baseline_fixer.fix_commas(s=test_input)
|
@@ -39,10 +47,11 @@ def test_fix_commas_fixes_incorrect_commas(baseline_fixer, test_input, expected)
|
|
39 |
|
40 |
@pytest.mark.parametrize(
|
41 |
"test_input, expected",
|
42 |
-
[['', ''],
|
43 |
-
['
|
|
|
44 |
['This: test - string should not, have any commas inside it...?',
|
45 |
-
'This test string should not have any commas inside it']]
|
46 |
)
|
47 |
def test__remove_punctuation(test_input, expected):
|
48 |
assert _remove_punctuation(test_input) == expected
|
|
|
30 |
['Even newlines\ntabs\tand others get preserved.',
|
31 |
'Even newlines,\ntabs\tand others get preserved.'],
|
32 |
['I had no Creativity left, therefore, I come here, and write useless examples, for this test.',
|
33 |
+
'I had no Creativity left therefore, I come here and write useless examples for this test.'],
|
34 |
+
[' This is a sentence. With, a lot of, useless punctuation!!??. O.o However we have to insert commas O-O, '
|
35 |
+
'nonetheless or we will fail this test.',
|
36 |
+
' This is a sentence. With a lot of useless punctuation!!??. O.o However, we have to insert commas O-O '
|
37 |
+
'nonetheless, or we will fail this test.'],
|
38 |
+
[" The ship 's secondary armament consisted of fourteen 45 @-@ calibre 6 @-@ inch ( 152 mm ) quick @-@ firing ( QF ) guns mounted in casemates . Lighter guns consisted of eight 47 @-@ millimetre ( 1 @.@ 9 in ) three @-@ pounder Hotchkiss guns and four 47 @-@ millimetre 2 @.@ 5 @-@ pounder Hotchkiss guns . The ship was also equipped with four submerged 18 @-@ inch torpedo tubes two on each broadside .",
|
39 |
+
" The ship 's secondary armament consisted of fourteen 45 @-@ calibre 6 @-@ inch ( 152 mm ) quick @-@ firing ( QF ) guns mounted in casemates . Lighter guns consisted of eight 47 @-@ millimetre ( 1 @.@ 9 in ), three @-@ pounder Hotchkiss guns and four 47 @-@ millimetre 2 @.@ 5 @-@ pounder Hotchkiss guns . The ship was also equipped with four submerged 18 @-@ inch torpedo tubes, two on each broadside ."]
|
40 |
+
|
41 |
+
]
|
42 |
)
|
43 |
def test_fix_commas_fixes_incorrect_commas(baseline_fixer, test_input, expected):
|
44 |
result = baseline_fixer.fix_commas(s=test_input)
|
|
|
47 |
|
48 |
@pytest.mark.parametrize(
|
49 |
"test_input, expected",
|
50 |
+
[['', ('', [])],
|
51 |
+
[' world...', (' world', [6, 7, 8])],
|
52 |
+
[',,,', ('', [])],
|
53 |
['This: test - string should not, have any commas inside it...?',
|
54 |
+
('This test string should not have any commas inside it', [4, 11, 57, 58, 59, 60])]]
|
55 |
)
|
56 |
def test__remove_punctuation(test_input, expected):
|
57 |
assert _remove_punctuation(test_input) == expected
|