polinaeterna
/

common_voice_generator

Model card Files Files and versions Community

polinaeterna HF staff

anton-l HF staff commited on Feb 2, 2023

Commit

a733f91

•

0 Parent(s):

Duplicate from anton-l/common_voice_generator

Browse files

Co-authored-by: Anton Lozhkov <[email protected]>

Files changed (9) hide show

.gitattributes +27 -0
.gitignore +1 -0
README.md +16 -0
README.template.md +241 -0
dataset_script.py +260 -0
generate_datasets.py +135 -0
languages.ftl +181 -0
publish.py +3 -0
test.py +5 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,27 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ common_voice_*

README.md ADDED Viewed

	@@ -0,0 +1,16 @@

+---
+duplicated_from: anton-l/common_voice_generator
+---
+## Common voice release generator
+1. Copy the latest release id from the `RELEASES` dict in https://github.com/common-voice/common-voice/blob/main/web/src/components/pages/datasets/releases.ts
+to the `VERSIONS` variable in `generate_datasets.py`.
+2. Copy the languages from https://github.com/common-voice/common-voice/blob/release-v1.78.0/web/locales/en/messages.ftl
+   (replacing `release-v1.78.0` with the latest version tag) to the `languages.ftl` file.
+3. Run `python generate_datasets.py` to generate the dataset repos.
+4. `cd ..`
+5. `huggingface-cli repo create --type dataset --organization mozilla-foundation common_voice_11_0`
+6. `git clone https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0`
+7. `cd common_voice_11_0`
+8. `cp ../common_voice_generator/common_voice_11_0/* ./`
+9. `git add . && git commit -m "Release" && git push`

README.template.md ADDED Viewed

	@@ -0,0 +1,241 @@

+---
+pretty_name: {{NAME}}
+annotations_creators:
+- crowdsourced
+language_creators:
+- crowdsourced
+language_bcp47:
+{{LANGUAGES}}
+license:
+- cc0-1.0
+multilinguality:
+- multilingual
+size_categories:
+{{SIZES}}
+source_datasets:
+- extended|common_voice
+task_categories:
+- speech-processing
+task_ids:
+- automatic-speech-recognition
+paperswithcode_id: common-voice
+extra_gated_prompt: "By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset."
+---
+# Dataset Card for {{NAME}}
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+## Dataset Description
+- **Homepage:** https://commonvoice.mozilla.org/en/datasets
+- **Repository:** https://github.com/common-voice/common-voice
+- **Paper:** https://arxiv.org/abs/1912.06670
+- **Leaderboard:** https://paperswithcode.com/dataset/common-voice
+- **Point of Contact:** [Anton Lozhkov](mailto:[email protected])
+### Dataset Summary
+The Common Voice dataset consists of a unique MP3 and corresponding text file.
+Many of the {{TOTAL_HRS}} recorded hours in the dataset also include demographic metadata like age, sex, and accent
+that can help improve the accuracy of speech recognition engines.
+The dataset currently consists of {{VAL_HRS}} validated hours in {{NUM_LANGS}} languages, but more voices and languages are always added.
+Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing.
+### Supported Tasks and Leaderboards
+The results for models trained on the Common Voice datasets are available via the
+[🤗 Speech Bench](https://huggingface.co/spaces/huggingface/hf-speech-bench)
+### Languages
+```
+{{LANGUAGES_HUMAN}}
+```
+## Dataset Structure
+### Data Instances
+A typical data point comprises the `path` to the audio file and its `sentence`.
+Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`.
+```python
+{
+  'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5',
+  'path': 'et/clips/common_voice_et_18318995.mp3',
+  'audio': {
+    'path': 'et/clips/common_voice_et_18318995.mp3',
+    'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
+    'sampling_rate': 48000
+  },
+  'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.',
+  'up_votes': 2,
+  'down_votes': 0,
+  'age': 'twenties',
+  'gender': 'male',
+  'accent': '',
+  'locale': 'et',
+  'segment': ''
+}
+```
+### Data Fields
+`client_id` (`string`): An id for which client (voice) made the recording
+`path` (`string`): The path to the audio file
+`audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
+`sentence` (`string`): The sentence the user was prompted to speak
+`up_votes` (`int64`): How many upvotes the audio file has received from reviewers
+`down_votes` (`int64`): How many downvotes the audio file has received from reviewers
+`age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`)
+`gender` (`string`): The gender of the speaker
+`accent` (`string`): Accent of the speaker
+`locale` (`string`): The locale of the speaker
+`segment` (`string`): Usually an empty field
+### Data Splits
+The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other.
+The validated data is data that has been validated with reviewers and received upvotes that the data is of high quality.
+The invalidated data is data has been invalidated by reviewers
+and received downvotes indicating that the data is of low quality.
+The reported data is data that has been reported, for different reasons.
+The other data is data that has not yet been reviewed.
+The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train.
+## Data Preprocessing Recommended by Hugging Face
+The following are data preprocessing steps advised by the Hugging Face team. They are accompanied by an example code snippet that shows how to put them to practice.
+Many examples in this dataset have trailing quotations marks, e.g _“the cat sat on the mat.“_. These trailing quotation marks do not change the actual meaning of the sentence, and it is near impossible to infer whether a sentence is a quotation or not a quotation from audio data alone. In these cases, it is advised to strip the quotation marks, leaving: _the cat sat on the mat_.
+In addition, the majority of training sentences end in punctuation ( . or ? or ! ), whereas just a small proportion do not. In the dev set, **almost all** sentences end in punctuation. Thus, it is recommended to append a full-stop ( . ) to the end of the small number of training examples that do not end in punctuation.
+```python
+from datasets import load_dataset
+ds = load_dataset("mozilla-foundation/{{DATASET_PATH}}", "en", use_auth_token=True)
+def prepare_dataset(batch):
+  """Function to preprocess the dataset with the .map method"""
+  transcription = batch["sentence"]
+  if transcription.startswith('"') and transcription.endswith('"'):
+    # we can remove trailing quotation marks as they do not affect the transcription
+    transcription = transcription[1:-1]
+  if transcription[-1] not in [".", "?", "!"]:
+    # append a full-stop to sentences that do not end in punctuation
+    transcription = transcription + "."
+  batch["sentence"] = transcription
+  return batch
+ds = ds.map(prepare_dataset, desc="preprocess dataset")
+```
+## Dataset Creation
+### Curation Rationale
+[Needs More Information]
+### Source Data
+#### Initial Data Collection and Normalization
+[Needs More Information]
+#### Who are the source language producers?
+[Needs More Information]
+### Annotations
+#### Annotation process
+[Needs More Information]
+#### Who are the annotators?
+[Needs More Information]
+### Personal and Sensitive Information
+The dataset consists of people who have donated their voice online.  You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
+## Considerations for Using the Data
+### Social Impact of Dataset
+The dataset consists of people who have donated their voice online.  You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
+### Discussion of Biases
+[More Information Needed]
+### Other Known Limitations
+[More Information Needed]
+## Additional Information
+### Dataset Curators
+[More Information Needed]
+### Licensing Information
+Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
+### Citation Information
+```
+@inproceedings{commonvoice:2020,
+  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
+  title = {Common Voice: A Massively-Multilingual Speech Corpus},
+  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
+  pages = {4211--4215},
+  year = 2020
+}
+```

dataset_script.py ADDED Viewed

	@@ -0,0 +1,260 @@

+# coding=utf-8
+# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Common Voice Dataset"""
+import csv
+import os
+import urllib
+import datasets
+import requests
+from datasets.utils.py_utils import size_str
+from huggingface_hub import HfApi, HfFolder
+from .languages import LANGUAGES
+from .release_stats import STATS
+_CITATION = """\
+@inproceedings{commonvoice:2020,
+  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
+  title = {Common Voice: A Massively-Multilingual Speech Corpus},
+  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
+  pages = {4211--4215},
+  year = 2020
+}
+"""
+_HOMEPAGE = "https://commonvoice.mozilla.org/en/datasets"
+_LICENSE = "https://creativecommons.org/publicdomain/zero/1.0/"
+_API_URL = "https://commonvoice.mozilla.org/api/v1"
+class CommonVoiceConfig(datasets.BuilderConfig):
+    """BuilderConfig for CommonVoice."""
+    def __init__(self, name, version, **kwargs):
+        self.language = kwargs.pop("language", None)
+        self.release_date = kwargs.pop("release_date", None)
+        self.num_clips = kwargs.pop("num_clips", None)
+        self.num_speakers = kwargs.pop("num_speakers", None)
+        self.validated_hr = kwargs.pop("validated_hr", None)
+        self.total_hr = kwargs.pop("total_hr", None)
+        self.size_bytes = kwargs.pop("size_bytes", None)
+        self.size_human = size_str(self.size_bytes)
+        description = (
+            f"Common Voice speech to text dataset in {self.language} released on {self.release_date}. "
+            f"The dataset comprises {self.validated_hr} hours of validated transcribed speech data "
+            f"out of {self.total_hr} hours in total from {self.num_speakers} speakers. "
+            f"The dataset contains {self.num_clips} audio clips and has a size of {self.size_human}."
+        )
+        super(CommonVoiceConfig, self).__init__(
+            name=name,
+            version=datasets.Version(version),
+            description=description,
+            **kwargs,
+        )
+class CommonVoice(datasets.GeneratorBasedBuilder):
+    DEFAULT_CONFIG_NAME = "en"
+    DEFAULT_WRITER_BATCH_SIZE = 1000
+    BUILDER_CONFIGS = [
+        CommonVoiceConfig(
+            name=lang,
+            version=STATS["version"],
+            language=LANGUAGES[lang],
+            release_date=STATS["date"],
+            num_clips=lang_stats["clips"],
+            num_speakers=lang_stats["users"],
+            validated_hr=float(lang_stats["validHrs"]) if lang_stats["validHrs"] else None,
+            total_hr=float(lang_stats["totalHrs"]) if lang_stats["totalHrs"] else None,
+            size_bytes=int(lang_stats["size"]) if lang_stats["size"] else None,
+        )
+        for lang, lang_stats in STATS["locales"].items()
+    ]
+    def _info(self):
+        total_languages = len(STATS["locales"])
+        total_valid_hours = STATS["totalValidHrs"]
+        description = (
+            "Common Voice is Mozilla's initiative to help teach machines how real people speak. "
+            f"The dataset currently consists of {total_valid_hours} validated hours of speech "
+            f" in {total_languages} languages, but more voices and languages are always added."
+        )
+        features = datasets.Features(
+            {
+                "client_id": datasets.Value("string"),
+                "path": datasets.Value("string"),
+                "audio": datasets.features.Audio(sampling_rate=48_000),
+                "sentence": datasets.Value("string"),
+                "up_votes": datasets.Value("int64"),
+                "down_votes": datasets.Value("int64"),
+                "age": datasets.Value("string"),
+                "gender": datasets.Value("string"),
+                "accent": datasets.Value("string"),
+                "locale": datasets.Value("string"),
+                "segment": datasets.Value("string"),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=description,
+            features=features,
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+            version=self.config.version,
+            # task_templates=[
+            #     AutomaticSpeechRecognition(audio_file_path_column="path", transcription_column="sentence")
+            # ],
+        )
+    def _get_bundle_url(self, locale, url_template):
+        # path = encodeURIComponent(path)
+        path = url_template.replace("{locale}", locale)
+        path = urllib.parse.quote(path.encode("utf-8"), safe="~()*!.'")
+        # use_cdn = self.config.size_bytes < 20 * 1024 * 1024 * 1024
+        # response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()
+        response = requests.get(f"{_API_URL}/bucket/dataset/{path}", timeout=10.0).json()
+        return response["url"]
+    def _log_download(self, locale, bundle_version, auth_token):
+        if isinstance(auth_token, bool):
+            auth_token = HfFolder().get_token()
+        whoami = HfApi().whoami(auth_token)
+        email = whoami["email"] if "email" in whoami else ""
+        payload = {"email": email, "locale": locale, "dataset": bundle_version}
+        requests.post(f"{_API_URL}/{locale}/downloaders", json=payload).json()
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        hf_auth_token = dl_manager.download_config.use_auth_token
+        if hf_auth_token is None:
+            raise ConnectionError(
+                "Please set use_auth_token=True or use_auth_token='<TOKEN>' to download this dataset"
+            )
+        bundle_url_template = STATS["bundleURLTemplate"]
+        bundle_version = bundle_url_template.split("/")[0]
+        dl_manager.download_config.ignore_url_params = True
+        self._log_download(self.config.name, bundle_version, hf_auth_token)
+        archive_path = dl_manager.download(self._get_bundle_url(self.config.name, bundle_url_template))
+        local_extracted_archive = dl_manager.extract(archive_path) if not dl_manager.is_streaming else None
+        if self.config.version < datasets.Version("5.0.0"):
+            path_to_data = ""
+        else:
+            path_to_data = "/".join([bundle_version, self.config.name])
+        path_to_clips = "/".join([path_to_data, "clips"]) if path_to_data else "clips"
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "train.tsv"]) if path_to_data else "train.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "test.tsv"]) if path_to_data else "test.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "dev.tsv"]) if path_to_data else "dev.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name="other",
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "other.tsv"]) if path_to_data else "other.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name="invalidated",
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "invalidated.tsv"])
+                    if path_to_data
+                    else "invalidated.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+        ]
+    def _generate_examples(
+        self,
+        local_extracted_archive,
+        archive_iterator,
+        metadata_filepath,
+        path_to_clips,
+    ):
+        """Yields examples."""
+        data_fields = list(self._info().features.keys())
+        metadata = {}
+        metadata_found = False
+        for path, f in archive_iterator:
+            if path == metadata_filepath:
+                metadata_found = True
+                lines = (line.decode("utf-8") for line in f)
+                reader = csv.DictReader(lines, delimiter="\t", quoting=csv.QUOTE_NONE)
+                for row in reader:
+                    # set absolute path for mp3 audio file
+                    if not row["path"].endswith(".mp3"):
+                        row["path"] += ".mp3"
+                    row["path"] = os.path.join(path_to_clips, row["path"])
+                    # accent -> accents in CV 8.0
+                    if "accents" in row:
+                        row["accent"] = row["accents"]
+                        del row["accents"]
+                    # if data is incomplete, fill with empty values
+                    for field in data_fields:
+                        if field not in row:
+                            row[field] = ""
+                    metadata[row["path"]] = row
+            elif path.startswith(path_to_clips):
+                assert metadata_found, "Found audio clips before the metadata TSV file."
+                if not metadata:
+                    break
+                if path in metadata:
+                    result = dict(metadata[path])
+                    # set the audio feature and the path to the extracted file
+                    path = os.path.join(local_extracted_archive, path) if local_extracted_archive else path
+                    result["audio"] = {"path": path, "bytes": f.read()}
+                    # set path to None if the audio file doesn't exist locally (i.e. in streaming mode)
+                    result["path"] = path if local_extracted_archive else None
+                    yield path, result

generate_datasets.py ADDED Viewed

	@@ -0,0 +1,135 @@

+import json
+import os
+import shutil
+import requests
+RELEASE_STATS_URL = "https://commonvoice.mozilla.org/dist/releases/{}.json"
+VERSIONS = [
+    {"semver": "1.0.0", "name": "common_voice_1_0", "release": "cv-corpus-1"},
+    {"semver": "2.0.0", "name": "common_voice_2_0", "release": "cv-corpus-2"},
+    {"semver": "3.0.0", "name": "common_voice_3_0", "release": "cv-corpus-3"},
+    {
+        "semver": "4.0.0",
+        "name": "common_voice_4_0",
+        "release": "cv-corpus-4-2019-12-10",
+    },
+    {
+        "semver": "5.0.0",
+        "name": "common_voice_5_0",
+        "release": "cv-corpus-5-2020-06-22",
+    },
+    {
+        "semver": "5.1.0",
+        "name": "common_voice_5_1",
+        "release": "cv-corpus-5.1-2020-06-22",
+    },
+    {
+        "semver": "6.0.0",
+        "name": "common_voice_6_0",
+        "release": "cv-corpus-6.0-2020-12-11",
+    },
+    {
+        "semver": "6.1.0",
+        "name": "common_voice_6_1",
+        "release": "cv-corpus-6.1-2020-12-11",
+    },
+    {
+        "semver": "7.0.0",
+        "name": "common_voice_7_0",
+        "release": "cv-corpus-7.0-2021-07-21",
+    },
+    {
+        "semver": "8.0.0",
+        "name": "common_voice_8_0",
+        "release": "cv-corpus-8.0-2022-01-19",
+    },
+    {
+        "semver": "9.0.0",
+        "name": "common_voice_9_0",
+        "release": "cv-corpus-9.0-2022-04-27",
+    },
+    {
+        "semver": "10.0.0",
+        "name": "common_voice_10_0",
+        "release": "cv-corpus-10.0-2022-07-04",
+    },
+    {
+        "semver": "11.0.0",
+        "name": "common_voice_11_0",
+        "release": "cv-corpus-11.0-2022-09-21",
+    },
+]
+def num_to_size(num: int):
+    if num < 1000:
+        return "n<1K"
+    elif num < 10_000:
+        return "1K<n<10K"
+    elif num < 100_000:
+        return "10K<n<100K"
+    elif num < 1_000_000:
+        return "100K<n<1M"
+    elif num < 10_000_000:
+        return "1M<n<10M"
+    elif num < 100_000_000:
+        return "10M<n<100M"
+    elif num < 1_000_000_000:
+        return "100M<n<1B"
+def get_language_names():
+    # source: https://github.com/common-voice/common-voice/blob/release-v1.71.0/web/locales/en/messages.ftl
+    languages = {}
+    with open("languages.ftl") as fin:
+        for line in fin:
+            lang_code, lang_name = line.strip().split(" = ")
+            languages[lang_code] = lang_name
+    return languages
+def main():
+    language_names = get_language_names()
+    for version in VERSIONS:
+        stats_url = RELEASE_STATS_URL.format(version["release"])
+        release_stats = requests.get(stats_url).text
+        release_stats = json.loads(release_stats)
+        release_stats["version"] = version["semver"]
+        dataset_path = version["name"]
+        os.makedirs(dataset_path, exist_ok=True)
+        with open(f"{dataset_path}/release_stats.py", "w") as fout:
+            fout.write("STATS = " + str(release_stats))
+        with open(f"README.template.md", "r") as fin:
+            readme = fin.read()
+            readme = readme.replace("{{NAME}}", release_stats["name"])
+            readme = readme.replace("{{DATASET_PATH}}", version["name"])
+            locales = sorted(release_stats["locales"].keys())
+            languages = [f"- {loc}" for loc in locales]
+            readme = readme.replace("{{LANGUAGES}}", "\n".join(languages))
+            sizes = [f"  {loc}:\n  - {num_to_size(release_stats['locales'][loc]['clips'])}" for loc in locales]
+            readme = readme.replace("{{SIZES}}", "\n".join(sizes))
+            languages_human = sorted([language_names[loc] for loc in locales])
+            readme = readme.replace("{{LANGUAGES_HUMAN}}", ", ".join(languages_human))
+            readme = readme.replace("{{TOTAL_HRS}}", str(release_stats["totalHrs"]))
+            readme = readme.replace("{{VAL_HRS}}", str(release_stats["totalValidHrs"]))
+            readme = readme.replace("{{NUM_LANGS}}", str(len(locales)))
+        with open(f"{dataset_path}/README.md", "w") as fout:
+            fout.write(readme)
+        with open(f"{dataset_path}/languages.py", "w") as fout:
+            fout.write("LANGUAGES = " + str(language_names))
+        shutil.copy("dataset_script.py", f"{dataset_path}/{dataset_path}.py")
+if __name__ == "__main__":
+    main()

languages.ftl ADDED Viewed

	@@ -0,0 +1,181 @@

+ab = Abkhaz
+ace = Acehnese
+ady = Adyghe
+af = Afrikaans
+am = Amharic
+an = Aragonese
+ar = Arabic
+arn = Mapudungun
+as = Assamese
+ast = Asturian
+az = Azerbaijani
+ba = Bashkir
+bas = Basaa
+be = Belarusian
+bg = Bulgarian
+bn = Bengali
+br = Breton
+bs = Bosnian
+bxr = Buryat
+ca = Catalan
+cak = Kaqchikel
+ckb = Central Kurdish
+cnh = Hakha Chin
+co = Corsican
+cs = Czech
+cv = Chuvash
+cy = Welsh
+da = Danish
+de = German
+dsb = Sorbian, Lower
+dv = Dhivehi
+dyu = Dioula
+el = Greek
+en = English
+eo = Esperanto
+es = Spanish
+et = Estonian
+eu = Basque
+fa = Persian
+ff = Fulah
+fi = Finnish
+fo = Faroese
+fr = French
+fy-NL = Frisian
+ga-IE = Irish
+gl = Galician
+gn = Guarani
+gom = Goan Konkani
+ha = Hausa
+he = Hebrew
+hi = Hindi
+hil = Hiligaynon
+hr = Croatian
+hsb = Sorbian, Upper
+ht = Haitian
+hu = Hungarian
+hy-AM = Armenian
+hyw = Armenian Western
+ia = Interlingua
+id = Indonesian
+ie = Interlingue
+ig = Igbo
+is = Icelandic
+it = Italian
+izh = Izhorian
+ja = Japanese
+jbo = Lojban
+ka = Georgian
+kaa = Karakalpak
+kab = Kabyle
+kbd = Kabardian
+ki = Kikuyu
+kk = Kazakh
+km = Khmer
+kmr = Kurmanji Kurdish
+kn = Kannada
+knn = Konkani (Devanagari)
+ko = Korean
+kpv = Komi-Zyrian
+kw = Cornish
+ky = Kyrgyz
+lb = Luxembourgish
+lg = Luganda
+lij = Ligurian
+ln = Lingala
+lo = Lao
+lt = Lithuanian
+lv = Latvian
+mai = Maithili
+mdf = Moksha
+mg = Malagasy
+mhr = Meadow Mari
+mk = Macedonian
+ml = Malayalam
+mn = Mongolian
+mni = Meetei Lon
+mos = Mossi
+mr = Marathi
+mrj = Hill Mari
+ms = Malay
+mt = Maltese
+my = Burmese
+myv = Erzya
+nan-tw = Taiwanese (Minnan)
+nb-NO = Norwegian Bokmål
+nd = IsiNdebele (North)
+ne-NP = Nepali
+nia = Nias
+nl = Dutch
+nn-NO = Norwegian Nynorsk
+nr = IsiNdebele (South)
+nso = Northern Sotho
+nyn = Runyankole
+oc = Occitan
+om = Afaan Ormoo
+or = Odia
+pa-IN = Punjabi
+pap-AW = Papiamento (Aruba)
+pl = Polish
+ps = Pashto
+pt = Portuguese
+quc = K'iche'
+quy = Quechua Chanka
+rm-sursilv = Romansh Sursilvan
+rm-vallader = Romansh Vallader
+ro = Romanian
+ru = Russian
+rw = Kinyarwanda
+sah = Sakha
+sat = Santali (Ol Chiki)
+sc = Sardinian
+scn = Sicilian
+sdh = Southern Kurdish
+shi = Shilha
+si = Sinhala
+sk = Slovak
+skr = Saraiki
+sl = Slovenian
+snk = Soninke
+so = Somali
+sq = Albanian
+sr = Serbian
+ss = Siswati
+st = Southern Sotho
+sv-SE = Swedish
+sw = Swahili
+syr = Syriac
+ta = Tamil
+te = Telugu
+tg = Tajik
+th = Thai
+ti = Tigrinya
+tig = Tigre
+tk = Turkmen
+tl = Tagalog
+tn = Setswana
+tok = Toki Pona
+tr = Turkish
+ts = Xitsonga
+tt = Tatar
+tw = Twi
+ty = Tahitian
+uby = Ubykh
+udm = Udmurt
+ug = Uyghur
+uk = Ukrainian
+ur = Urdu
+uz = Uzbek
+ve = Tshivenda
+vec = Venetian
+vi = Vietnamese
+vot = Votic
+xh = Xhosa
+yi = Yiddish
+yo = Yoruba
+yue = Cantonese
+zgh = Tamazight
+zh-CN = Chinese (China)
+zh-HK = Chinese (Hong Kong)
+zh-TW = Chinese (Taiwan)
+zu = Zulu

publish.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from huggingface_hub import create_repo
2	+
3	+ create_repo("mozilla-foundation/common_voice_10_0", repo_type="dataset")

test.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from datasets import load_dataset
+dataset = load_dataset("./common_voice_11_0", "et", split="test", use_auth_token=True)
+print(dataset)
+print(dataset[100])