Post
2411
πΊ Introducing the YouTube-Commons Dataset πΊ
π Overview: The YouTube Commons Dataset is a comprehensive collection of 30 billion words from 15,112,121 original and automatically translated transcripts, drawn from 2,063,066 videos on YouTube.
π License: All videos are shared under the CC-BY license, with the majority (71%) in English.
π€ Applications: This dataset is ideal for training powerful AI models for converting speech to text (ASR) and translation models.
π Utilization: The text can be used for model training and is republishable for reproducibility purposes.
π€ Collaboration: This dataset is the result of a collaboration between state start-up LANGU:IA, the French Ministry of Culture, and DINUM. It will be expanded in the coming months.
π Explore the dataset here: https://lnkd.in/d_paWKFE
#YouTubeCommons #AIResearch #MachineLearning #OpenData #ArtificialIntelligence #NLP #Dataset #TechCollaboration #Innovation #DigitalTransformation
π Overview: The YouTube Commons Dataset is a comprehensive collection of 30 billion words from 15,112,121 original and automatically translated transcripts, drawn from 2,063,066 videos on YouTube.
π License: All videos are shared under the CC-BY license, with the majority (71%) in English.
π€ Applications: This dataset is ideal for training powerful AI models for converting speech to text (ASR) and translation models.
π Utilization: The text can be used for model training and is republishable for reproducibility purposes.
π€ Collaboration: This dataset is the result of a collaboration between state start-up LANGU:IA, the French Ministry of Culture, and DINUM. It will be expanded in the coming months.
π Explore the dataset here: https://lnkd.in/d_paWKFE
#YouTubeCommons #AIResearch #MachineLearning #OpenData #ArtificialIntelligence #NLP #Dataset #TechCollaboration #Innovation #DigitalTransformation