whisperkittools generated README.md
Browse files
README.md
CHANGED
@@ -9,42 +9,42 @@ tags:
|
|
9 |
- coreml
|
10 |
- asr
|
11 |
- quantized
|
12 |
-
- automatic-speech-recognition
|
13 |
-
inference: false
|
14 |
---
|
15 |
# WhisperKit Evaluation Results
|
16 |
|
17 |
|
18 |
|
19 |
## Dataset: `librispeech`
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
| [
|
25 |
-
| [WhisperKit/openai_whisper-large-
|
26 |
-
| [WhisperKit/openai_whisper-large-
|
27 |
-
| [WhisperKit/openai_whisper-large-
|
28 |
-
| [WhisperKit/openai_whisper-large-
|
29 |
-
| [WhisperKit/openai_whisper-large-
|
30 |
-
| [WhisperKit/openai_whisper-large-
|
31 |
-
| [WhisperKit/openai_whisper-
|
32 |
-
| [WhisperKit/openai_whisper-small](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech)
|
33 |
-
| [WhisperKit/openai_whisper-
|
34 |
-
| [WhisperKit/openai_whisper-base](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech)
|
35 |
-
| [WhisperKit/openai_whisper-
|
36 |
-
| [WhisperKit/openai_whisper-tiny](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech)
|
37 |
-
| [
|
|
|
38 |
|
39 |
## Dataset: `earnings22`
|
|
|
40 |
|
41 |
-
| | WER | QoI (
|
42 |
-
|
43 |
-
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22) |
|
44 |
-
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/earnings22) |
|
45 |
-
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/earnings22) |
|
46 |
-
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/earnings22) |
|
47 |
-
| [whisper.cpp/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/whisper.cpp/openai_whisper-large-v3/earnings22) |
|
48 |
|
49 |
|
50 |
We believe that rigorously measuring the quality of inference is necessary for developers and
|
@@ -53,12 +53,14 @@ any machine learning model in production. To contextualize `WhisperKit`, we take
|
|
53 |
implementations and benchmark them using a consistent evaluation harness:
|
54 |
|
55 |
Server-side:
|
56 |
-
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text)
|
|
|
57 |
|
58 |
On-device:
|
59 |
- `WhisperKit`: Argmax's implementation [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [[Repo]](https://github.com/argmaxinc/WhisperKit)
|
60 |
- `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
|
61 |
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
|
|
62 |
|
63 |
`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
|
64 |
in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
|
|
9 |
- coreml
|
10 |
- asr
|
11 |
- quantized
|
|
|
|
|
12 |
---
|
13 |
# WhisperKit Evaluation Results
|
14 |
|
15 |
|
16 |
|
17 |
## Dataset: `librispeech`
|
18 |
+
(Short-form Audio (<30s/clip) - 5 hours of English audiobook clips)
|
19 |
+
|
20 |
+
| | WER (↓) | QoI (↑) | File Size (MB) |
|
21 |
+
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|----------:|-----------------:|
|
22 |
+
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech) | 2.35 | 100 | 3100 |
|
23 |
+
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 2.04 | 95.2 | 3100 |
|
24 |
+
| [WhisperKit/openai_whisper-large-v3_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech) | 2.03 | 95.4 | 3100 |
|
25 |
+
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB/librispeech) | 1.99 | 94.8 | 1018 |
|
26 |
+
| [WhisperKit/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech) | 2.77 | 96.6 | 3100 |
|
27 |
+
| [WhisperKit/openai_whisper-large-v2_1050MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_1050MB/librispeech) | 2.81 | 95 | 1050 |
|
28 |
+
| [WhisperKit/openai_whisper-large-v2_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech) | 2.76 | 96.6 | 3100 |
|
29 |
+
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB/librispeech) | 2.66 | 94.9 | 1022 |
|
30 |
+
| [WhisperKit/openai_whisper-small.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech) | 3.12 | 85.8 | 483 |
|
31 |
+
| [WhisperKit/openai_whisper-small](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech) | 3.45 | 83 | 483 |
|
32 |
+
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech) | 3.98 | 75.3 | 145 |
|
33 |
+
| [WhisperKit/openai_whisper-base](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech) | 4.97 | 67.2 | 145 |
|
34 |
+
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech) | 5.61 | 63.9 | 66 |
|
35 |
+
| [WhisperKit/openai_whisper-tiny](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech) | 7.47 | 52.5 | 66 |
|
36 |
+
| [whisper.cpp/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/whisper.cpp/openai_whisper-large-v3/librispeech) | 1.97 | 95.4 | 3100 |
|
37 |
|
38 |
## Dataset: `earnings22`
|
39 |
+
(Long-Form Audio (>1hr/clip) - 120 hours of earnings call recordings in English with various accents)
|
40 |
|
41 |
+
| | WER (↓) | QoI (↑) | File Size (MB) |
|
42 |
+
|:------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|----------:|-----------------:|
|
43 |
+
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22) | 16.27 | 100 | 3100 |
|
44 |
+
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/earnings22) | 15.17 | 58.5 | 3100 |
|
45 |
+
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/earnings22) | 23.49 | 6.5 | 145 |
|
46 |
+
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/earnings22) | 28.64 | 5.7 | 66 |
|
47 |
+
| [whisper.cpp/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/whisper.cpp/openai_whisper-large-v3/earnings22) | 33.58 | 6.5 | 3100 |
|
48 |
|
49 |
|
50 |
We believe that rigorously measuring the quality of inference is necessary for developers and
|
|
|
53 |
implementations and benchmark them using a consistent evaluation harness:
|
54 |
|
55 |
Server-side:
|
56 |
+
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text)
|
57 |
+
($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)
|
58 |
|
59 |
On-device:
|
60 |
- `WhisperKit`: Argmax's implementation [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [[Repo]](https://github.com/argmaxinc/WhisperKit)
|
61 |
- `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
|
62 |
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
63 |
+
(All on-device implementations are available for free under MIT license as of 03/19/2024)
|
64 |
|
65 |
`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
|
66 |
in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|