|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- audio-captioning |
|
- audiocaps |
|
- clotho |
|
- dcase-challenge |
|
- icassp-24 |
|
--- |
|
## Summary |
|
This repo contains the config & pretrained weights of the model described in the following paper: |
|
- **Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation** |
|
Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe |
|
Int. Conf. on Acoustics, Speech, and Signal Processing (**ICASSP**) 2024 |
|
[[arXiv page](https://arxiv.org/abs/2309.17352)] |
|
## GitHub Repository |
|
To use this model, please refer to our code published at: |
|
- https://github.com/slSeanWU/beats-conformer-bart-audio-captioner |
|
## Training Data |
|
- Pretrain |
|
- **AudioCaps**: https://github.com/cdjkim/audiocaps/tree/master |
|
- **ChatGPT mix-ups from Clotho**: https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K |
|
- Finetune |
|
- **Clotho (V2)**: https://zenodo.org/records/4783391 |
|
## BibTex |
|
If you find our model useful, please consider citing our paper. Thanks! |
|
``` |
|
@inproceedings{wu2024improving, |
|
title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation}, |
|
author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji}, |
|
booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
year={2024} |
|
} |
|
``` |