whisperkittools generated README.md
Browse files
README.md
CHANGED
@@ -16,31 +16,34 @@ tags:
|
|
16 |
|
17 |
### Quality Evaluation
|
18 |
|
19 |
-
|
|
20 |
-
|
21 |
-
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml
|
22 |
-
| [WhisperKit/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml
|
23 |
-
| [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co/argmaxinc/whisperkit-coreml
|
24 |
-
| [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml
|
25 |
-
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co/argmaxinc/whisperkit-coreml
|
26 |
-
| [WhisperKit/openai_whisper-small](https://huggingface.co/argmaxinc/whisperkit-coreml
|
27 |
-
| [WhisperKit/openai_whisper-base](https://huggingface.co/argmaxinc/whisperkit-coreml
|
28 |
-
| [WhisperKit/openai_whisper-tiny](https://huggingface.co/argmaxinc/whisperkit-coreml
|
29 |
-
| [WhisperKit/openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml
|
30 |
-
| [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml
|
31 |
-
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://huggingface.co/argmaxinc/whisperkit-coreml
|
32 |
-
| [whisper.cpp/openai_whisper-large-v2-q5_0](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v2-q5_0) | 2.8 | 96.6 | 1080 |
|
33 |
-
| [whisper.cpp/openai_whisper-large-v3-q5_0](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v3-q5_0) | 2.35 | 95.6 | 1080 |
|
34 |
|
35 |
|
36 |
### Quality-of-Inference (QoI) Certification
|
37 |
We believe that rigorously measuring the quality of inference is necessary for developers and
|
38 |
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
39 |
-
|
40 |
-
|
41 |
-
implementations soon so developers can certify the behavior change (if any) caused by
|
42 |
-
alternating use of WhisperKit with (or migration from) these implementations.
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
45 |
which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
|
46 |
backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
@@ -56,23 +59,16 @@ for example in dataset:
|
|
56 |
qoi = (sum(qoi) / len(qoi)) * 100.
|
57 |
```
|
58 |
|
59 |
-
We
|
60 |
-
|
61 |
-
|
62 |
-
as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
|
63 |
-
(120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
|
64 |
-
their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
|
65 |
-
same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
|
66 |
-
for details.
|
67 |
|
68 |
### Reproducing Results
|
69 |
Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
|
70 |
Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
|
71 |
we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
|
72 |
-
run identical [evaluation jobs](#evaluation)
|
73 |
-
|
74 |
-
evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
|
75 |
-
1 day to complete the same evaluation.
|
76 |
|
77 |
|
78 |
|
@@ -81,6 +77,7 @@ Glossary:
|
|
81 |
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|
82 |
as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
|
83 |
|
84 |
-
- `_*MB`: Indicates the presence of
|
85 |
-
`_AudioEncoder-5.8bits_TextDecoder-6.
|
|
|
86 |
|
|
|
16 |
|
17 |
### Quality Evaluation
|
18 |
|
19 |
+
| | WER | QoI (%) | File Size (MB) |
|
20 |
+
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
|
21 |
+
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2) | 2.85 | 100 | 3100 |
|
22 |
+
| [WhisperKit/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2) | 3.28 | 96.6 | 3100 |
|
23 |
+
| [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2_1050MB) | 3.32 | 95 | 1050 |
|
24 |
+
| [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2_turbo) | 3.24 | 96.6 | 3100 |
|
25 |
+
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB) | 3.33 | 94.9 | 1022 |
|
26 |
+
| [WhisperKit/openai_whisper-small](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-small) | 3.98 | 82.9 | 483 |
|
27 |
+
| [WhisperKit/openai_whisper-base](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-base) | 6.11 | 67.1 | 145 |
|
28 |
+
| [WhisperKit/openai_whisper-tiny](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-tiny) | 8.94 | 52.4 | 66 |
|
29 |
+
| [WhisperKit/openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v3) | 2.48 | 95.2 | 3100 |
|
30 |
+
| [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v3_turbo) | 2.44 | 95.4 | 3100 |
|
31 |
+
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB) | 2.49 | 94.8 | 1018 |
|
|
|
|
|
32 |
|
33 |
|
34 |
### Quality-of-Inference (QoI) Certification
|
35 |
We believe that rigorously measuring the quality of inference is necessary for developers and
|
36 |
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
37 |
+
any machine learning model in production. For WhisperKit, we take the following implementations
|
38 |
+
and benchmark them using consistent evaluation harnesses:
|
|
|
|
|
39 |
|
40 |
+
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text)($0.36/hour as of 02/29/24, 25MB max file size)
|
41 |
+
- `WhisperKit`: Argmax's Core ML implementation [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [[Repo]](https://github.com/argmaxinc/WhisperKit)
|
42 |
+
- `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
|
43 |
+
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
44 |
+
|
45 |
+
`WhisperOpenAIAPI` is the reference and we assume that it is using the equivalent of
|
46 |
+
[openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) in float16 precision.
|
47 |
In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
48 |
which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
|
49 |
backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
|
|
59 |
qoi = (sum(qoi) / len(qoi)) * 100.
|
60 |
```
|
61 |
|
62 |
+
We use `librispeech/test.clean` (~5 hours of short English audio clips) and `earnings22` (~120 hours of long English audio clips with various accents).
|
63 |
+
We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and whisperkittools offers
|
64 |
+
the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset) for details.
|
|
|
|
|
|
|
|
|
|
|
65 |
|
66 |
### Reproducing Results
|
67 |
Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
|
68 |
Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
|
69 |
we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
|
70 |
+
run identical [evaluation jobs](#evaluation) locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
|
71 |
+
evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than 1 day to complete the same evaluation.
|
|
|
|
|
72 |
|
73 |
|
74 |
|
|
|
77 |
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|
78 |
as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
|
79 |
|
80 |
+
- `_*MB`: Indicates the presence of model compression. Instead of cluttering the filename with details like
|
81 |
+
`_AudioEncoder-5.8bits_TextDecoder-6.1bits_QLoRA-rank=16`, we choose to summarize the compression spec as the
|
82 |
+
resulting total file size since this is what matters to developers in production.
|
83 |
|