|
--- |
|
library_name: keras-hub |
|
license: mit |
|
tags: |
|
- speech-recognition |
|
- keras |
|
- automatic-speech-recognition |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
### Model Overview |
|
⚠️ Whisper is currently only available via the `keras-hub-nightly` package. Use `pip install keras-hub-nightly` to try this model. |
|
|
|
A Whisper encoder-decoder network for speech. |
|
|
|
This class implements a Transformer-based encoder-decoder model as |
|
described in |
|
["Robust Speech Recognition via Large-Scale Weak Supervision"](https://arxiv.org/abs/2212.04356). |
|
It includes the embedding lookups and transformer layers, but not the head |
|
for predicting the next token. |
|
|
|
The default constructor gives a fully customizable, randomly initialized Whisper |
|
model with any number of layers, heads, and embedding dimensions. To load |
|
preset architectures and weights, use the `from_preset()` constructor. |
|
|
|
Disclaimer: Pre-trained models are provided on an "as is" basis, without |
|
warranties or conditions of any kind. The underlying model is provided by a |
|
third party and subject to a separate license, available |
|
[here](https://github.com/openai/whisper). |
|
|
|
|
|
__Arguments__ |
|
|
|
|
|
- __vocabulary_size__: int. The size of the token vocabulary. |
|
- __num_layers__: int. The number of transformer encoder layers and |
|
transformer decoder layers. |
|
- __num_heads__: int. The number of attention heads for each transformer. |
|
The hidden size must be divisible by the number of attention heads. |
|
- __hidden_dim__: int. The size of the transformer encoding and pooler layers. |
|
- __intermediate_dim__: int. The output dimension of the first Dense layer in |
|
a two-layer feedforward network for each transformer. |
|
- __num_mels__: int. The number of mel-frequency filters. Defaults to `80`. |
|
- __dropout__: float. Dropout probability for the Transformer encoder. |
|
- __max_encoder_sequence_length__: int. The maximum sequence length that the |
|
audio encoder can consume. Since the second convolutional layer in |
|
the encoder reduces the sequence length by half (stride of 2), we |
|
use `max_encoder_sequence_length // 2` as the sequence length for the |
|
positional embedding layer. |
|
- __max_decoder_sequence_length__: int. The maximum sequence length that the |
|
text decoder can consume. |
|
|
|
## Example Usage |
|
```python |
|
import keras_hub |
|
import keras_core as keras |
|
import numpy as np |
|
``` |
|
|
|
|
|
|
|
```python |
|
input_data = { |
|
"encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"), |
|
"decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"), |
|
"decoder_padding_mask": np.array( |
|
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] |
|
), |
|
} |
|
|
|
# Randomly initialized Whisper encoder-decoder model with a custom config. |
|
model = keras_hub.models.WhisperBackbone( |
|
vocabulary_size=51864, |
|
num_layers=4, |
|
num_heads=4, |
|
hidden_dim=256, |
|
intermediate_dim=512, |
|
max_encoder_sequence_length=128, |
|
max_decoder_sequence_length=128, |
|
) |
|
model(input_data) |
|
``` |
|
|
|
## Example Usage with Hugging Face URI |
|
|
|
```python |
|
import keras_hub |
|
import keras_core as keras |
|
import numpy as np |
|
``` |
|
|
|
|
|
|
|
```python |
|
input_data = { |
|
"encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"), |
|
"decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"), |
|
"decoder_padding_mask": np.array( |
|
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] |
|
), |
|
} |
|
|
|
# Randomly initialized Whisper encoder-decoder model with a custom config. |
|
model = keras_hub.models.WhisperBackbone( |
|
vocabulary_size=51864, |
|
num_layers=4, |
|
num_heads=4, |
|
hidden_dim=256, |
|
intermediate_dim=512, |
|
max_encoder_sequence_length=128, |
|
max_decoder_sequence_length=128, |
|
) |
|
model(input_data) |
|
``` |
|
|