sino commited on
Commit
d0ec5fd
1 Parent(s): a6aa4e3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -10
README.md CHANGED
@@ -1,13 +1,121 @@
1
  ---
 
 
 
2
  tags:
3
- - music
4
- - text-generation-inference
 
5
  ---
6
- # Environment:
7
- ## step 1:
8
- conda create -name SpectPrompt python=3.9
9
- ## step 2:
10
- pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
11
- pip install transformers datasets librosa einops_exts einops mmcls peft ipdb torchlibrosa
12
- pip install -U openmim
13
- mim install mmcv==1.7.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - zh
4
+ - en
5
  tags:
6
+ - qwen
7
+ pipeline_tag: text-generation
8
+ inference: false
9
  ---
10
+
11
+ # Qwen-Audio
12
+
13
+ <br>
14
+
15
+ <p align="center">
16
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/audio_logo.jpg" width="400"/>
17
+ <p>
18
+ <br>
19
+
20
+ <p align="center">
21
+ Qwen-Audio <a href="https://www.modelscope.cn/models/qwen/QWen-Audio/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-Audio">🤗</a>&nbsp | Qwen-Audio-Chat <a href="https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-Audio-Chat">🤗</a>&nbsp | &nbsp&nbsp Demo<a href="https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary"> 🤖</a> | <a href="https://huggingface.co/spaces/Qwen/Qwen-Audio">🤗</a>&nbsp
22
+ <br>
23
+ &nbsp&nbsp<a href="https://qwen-audio.github.io/Qwen-Audio/">Homepage</a>&nbsp | &nbsp<a href="http://arxiv.org/abs/2311.07919">Paper</a> | &nbsp<a href="https://huggingface.co/papers/2311.07919">🤗</a>
24
+ </p>
25
+ <br><br>
26
+
27
+ **Qwen-Audio** (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:
28
+
29
+ - **Fundamental audio models**: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
30
+ - **Multi-task learning framework for all types of audios**: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
31
+ - **Strong Performance**: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
32
+ - **Flexible multi-run chat from audio and text input**: Qwen-Audio supports multiple-audio analysis, sound understading and reasoning, music appreciation, and tool usage for speech editing.
33
+
34
+ **Qwen-Audio** 是阿里云研发的大规模音频语言模型(Large Audio Language Model)。Qwen-Audio 可以以多种音频 (包括说话人语音、自然音、音乐、歌声)和文本作为输入,并以文本作为输出。Qwen-Audio 系列模型的特点包括:
35
+
36
+ - **音频基石模型**:Qwen-Audio是一个性能卓越的通用的音频理解模型,支持各种任务、语言和音频类型。在Qwen-Audio的基础上,我们通过指令微调开发了Qwen-Audio-Chat,支持多轮、多语言、多语言对话。Qwen-Audio和Qwen-Audio-Chat模型均已开源。
37
+ - **兼容多种复杂音频的多任务学习框架**:为了避免由于数据收集来源不同以及任务类型不同,带来的音频到文本的一对多的干扰问题,我们提出了一种多任务训练框架,实现相似任务的知识共享,并尽可能减少不同任务之间的干扰。通过提出的框架,Qwen-Audio可以容纳训练超过30多种不同的音频任务;
38
+ - **出色的性能**:Qwen-Audio在不需要任何任务特定的微调的情况下,在各种基准任务上取得了领先的结果。具体得,Qwen-Audio在Aishell1、cochlscene、ClothoAQA和VocalSound的测试集上都达到了SOTA;
39
+ - **支持多轮音频和文本对话,支持各种语音场景**:Qwen-Audio-Chat支持声音理解和推理、音乐欣赏、多音频分析、多轮音频-文本交错对话以及外部语音工具的使用(如语音编辑)。
40
+
41
+
42
+ We release Qwen-Audio and Qwen-Audio-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-Audio, please refer to our [Github Repo](https://github.com/QwenLM/Qwen-Audio/tree/main). This repo is the one for Qwen-Audio.
43
+ <br>
44
+
45
+ 目前,我们提供了Qwen-Audio和Qwen-Audio-Chat两个模型,分别为预训练模型和Chat模型。如果想了解更多关于信息,请点击[链接](https://github.com/QwenLM/Qwen-Audio/tree/main)查看Github仓库。本仓库为Qwen-Audio仓库。
46
+
47
+
48
+ ## Requirements
49
+ * python 3.8 and above
50
+ * pytorch 1.12 and above, 2.0 and above are recommended
51
+ * CUDA 11.4 and above are recommended (this is for GPU users)
52
+ * FFmpeg
53
+ <br>
54
+
55
+ ## Quickstart
56
+ Below, we provide simple examples to show how to use Qwen-Audio with 🤗 Transformers.
57
+
58
+ Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
59
+
60
+ ```bash
61
+ pip install -r requirements.txt
62
+ ```
63
+ For more details, please refer to [tutorial](https://github.com/QwenLM/Qwen-Audio).
64
+
65
+ #### 🤗 Transformers
66
+
67
+ To use Qwen-Audio for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**
68
+
69
+ ```python
70
+ from transformers import AutoModelForCausalLM, AutoTokenizer
71
+ from transformers.generation import GenerationConfig
72
+ import torch
73
+ torch.manual_seed(1234)
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
76
+
77
+ # use bf16
78
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()
79
+ # use fp16
80
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()
81
+ # use cpu only
82
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()
83
+ # use cuda device
84
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()
85
+
86
+ # Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
87
+ # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
88
+ audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac"
89
+ sp_prompt = "<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"
90
+ query = f"<audio>{audio_url}</audio>{sp_prompt}"
91
+ audio_info = tokenizer.process_audio(query)
92
+ inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
93
+ inputs = inputs.to(model.device)
94
+ pred = model.generate(**inputs, audio_info=audio_info)
95
+ response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
96
+ print(response)
97
+ # <audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio><|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>
98
+ ```
99
+
100
+
101
+ ## License Agreement
102
+ Researchers and developers are free to use the codes and model weights of Qwen-Audio. We also allow its commercial use. Check our license at [LICENSE](https://github.com/QwenLM/Qwen-Audio/blob/main/LICENSE.txt) for more details.
103
+ <br>
104
+
105
+ ## Citation
106
+ If you find our paper and code useful in your research, please consider giving a star and citation
107
+
108
+ ```BibTeX
109
+ @article{Qwen-Audio,
110
+ title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
111
+ author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},
112
+ journal={arXiv preprint arXiv:2311.07919},
113
+ year={2023}
114
+ }
115
+ ```
116
+ <br>
117
+
118
+ ## Contact Us
119
+
120
+ If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
121
+