File size: 3,160 Bytes
de35d2f
 
 
 
6b939cc
049d0fd
de35d2f
be85aed
de35d2f
049d0fd
de35d2f
 
 
 
 
 
4159d6d
de35d2f
 
 
84a7e64
de35d2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
049d0fd
de35d2f
458938b
de35d2f
 
 
 
 
 
 
 
 
995d8d1
de35d2f
 
 
995d8d1
de35d2f
 
 
995d8d1
de35d2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
title: Automatic speech recognition
sdk: gradio
app_file: src/app.py
python_version: 3.11
sdk_version: 4.44.0
app_port: 7860
tags: [asr, stt, speech-to-text, whisper, pyannote, diarization]
pinned: true
emoji: 👂
---

# Automatic speech recognition

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
![Python 3.10](badges/python3_10.svg)
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/tools4eu/asr)

![Screenshot](img/screenshot.jpg)

Automatic speech recognition uses [Whisper](https://github.com/openai/whisper) to transcribe audio files and [pyannote-audio](https://github.com/pyannote/pyannote-audio) to add speaker diarization.

It has optimized inference because of batching and Scale-Product-Attention (SDPA) or flash attention (if available).

> :warning: **Always review transcriptions.** Transcriptions are done using AI models which are never 100% accurate.

The repo contains (will contain) code to run the software

- as a command-line tool
- as graphical interface
- as an inference API

## Installation

### Prerequisites

The host machine must have an Nvidia graphics card with CUDA 12.x installed natively, preferably [CUDA 12.1](https://developer.nvidia.com/cuda-12-1-0-download-archive), even when using Docker.

The graphics card should have at least 12GB VRAM for the largest model.

The host machine must have Docker installed.

For a Linux server, follow [these instructions](https://docs.docker.com/engine/install/)

For a desktop (visual UI available), follow [these instructions](https://www.docker.com/products/docker-desktop/)

### Docker (recommended)

The Docker image is prebuilt and maintained at [tools4eu/docker-asr](https://github.com/tools4eu/docker-asr) and available on the Docker Hub as [tools4eu/asr](https://hub.docker.com/repository/docker/tools4eu/asr/general)

Run the Docker image, forward port 7860 (Gradio) and pass your GPU(s) to the container

`docker run -p 7860:7860 --gpus all tools4eu/asr`

Or in detached mode (in background)

`docker run -d -p 7860:7860 --gpus all tools4eu/asr`

You can check whether it is running with

`docker ps`

If you want to follow terminal output of a detached container, you can use

`docker logs -f <first n digits of the container id>`

The first time a transcription is requested, it will download the model.
To avoid this happening each time, make sure you stop and start the same container, instead of using

`docker run ...` again

use `docker start <first n digits of container>`

You can find the list of all containers, also stopped ones by using

`docker ps -a`

To open the app, open your **browser** and go to `localhost:7860`

### Dev Container

Open the project Visual Studio Code and use CTRL + SHIFT + P and type "Rebuild and reopen in container".

After building, open up a terminal and activate the virtual environment

`source /home/jovyan/venv/bin/activate`

Then run the app

`python src/app.py`

## License

GNU General Public License v3.0 or later

See [COPYING](COPYING) to see the full text.