--- license: mit tags: - AudioMAE - PyTorch --- # AudioMAE This model card provides an easy-to-use API for a pretrained AudioMAE [1] whose weights are from [its original reposotiry](https://github.com/facebookresearch/AudioMAE). The provided model is specifically designed to easily obtain learned representations given an input audio file. The resulting representation $z$ has a dimension of $(d, h, w)$ for a single audio file, where $d$, $h$, and $w$ denote a latent dimension size, latent frequency dim, and latent temporal dim, respectively. [2] indicates that both frequency and temporal dimensional semantics are preserved in $z$. # Usage ```python model = AudioMAEencoder.from_pretrained("hance-ai/audiomae") # load the pretrained model z = model.encode('path/audio_fname.wav') # (768, 8, 64) = (latent_dim_size, latent_freq_dim, latent_temporal_dim) ``` Depending on a task, a different pooling strategy should be facilitated. For instance, a global average pooling can be used for a classification task. [2] uses an adaptive pooling. # Sanity Check Result In the following, a spectrogram of an input audio and corresponding $z$ are visualized. The input audio is 10s, containing baby coughing, hiccuping, and adult sneezing. The latent dimension size of $z$ is reduced to 8 using PCA for visualization.
The result shows that the presence of labeled sound is clearly captured in the 3rd principal component (PC). While the baby coughing and hiccuping sounds are not so distinugisable up to the 5th PC, they are in the 6th PC. This result briefly shows the effectiveness of the pretrained AudioMAE. # References [1] Huang, Po-Yao, et al. "Masked autoencoders that listen." Advances in Neural Information Processing Systems 35 (2022): 28708-28720. [2] Liu, Haohe, et al. "Audioldm 2: Learning holistic audio generation with self-supervised pretraining." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2024).