33 1 6

nyuuzyou

https://ducks.party/donate

nyuuzyou

AI & ML interests

None yet

Recent Activity

New activity 3 days ago

nyuuzyou/tamago:[bot] Conversion to Parquet

Reacted to AkimfromParis's post with ❤️ 3 days ago

🇯🇵 The Open Japanese LLM Leaderboard created by LLM-jp 🌸 in partnership with HuggingFace 🤗 was released today! Blog: https://huggingface.co/blog/leaderboard-japanese Space: https://huggingface.co/spaces/llm-jp/open-japanese-llm-leaderboard 🌍 The leaderboard is available in both Japanese and English 📚 Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs 📊 The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners 💻 For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio 🔬 The radar chart provides a very interesting visualization of metrics! 🌱 We are using the Japanese research platform, MDX, so please be patient! ⚡ LLMs bigger than +70B will be evaluated soon… How do you say “GPUs Go Brrr” in Japanese - > GPUがブンブン～! (To pronounce "GPU ga bunbun!") 🔥

posted an update 3 days ago

🎵 Introducing Tamago Music Dataset - https://huggingface.co/datasets/nyuuzyou/tamago A collection of 1,567 music tracks featuring: - Complete metadata with audio files and cover artwork - Rich track information including titles, descriptions, and genres - User engagement metrics like play counts and reactions - English language content from independent artists - Released under Creative Commons Zero (CC0) license Dataset structure includes: - Track metadata (titles, descriptions, genres, tags) - Associated media (audio files, cover images) - Artist information and engagement metrics Particularly valuable for: - Music generation model training - Cross-modal analysis - Audio classification tasks - Music style and genre analysis

View all activity

Organizations

nyuuzyou's activity

Reacted to AkimfromParis's post with ❤️ 3 days ago

Post

1349

🇯🇵 The Open Japanese LLM Leaderboard created by LLM-jp 🌸 in partnership with HuggingFace 🤗 was released today!

Blog: https://huggingface.co/blog/leaderboard-japanese
Space: llm-jp/open-japanese-llm-leaderboard

🌍 The leaderboard is available in both Japanese and English
📚 Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs
📊 The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners
💻 For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio
🔬 The radar chart provides a very interesting visualization of metrics!
🌱 We are using the Japanese research platform, MDX, so please be patient!
⚡ LLMs bigger than +70B will be evaluated soon…

How do you say “GPUs Go Brrr” in Japanese - > GPUがブンブン～! (To pronounce "GPU ga bunbun!") 🔥

4 replies

posted an update 3 days ago

Post

262

🎵 Introducing Tamago Music Dataset - nyuuzyou/tamago

A collection of 1,567 music tracks featuring:

- Complete metadata with audio files and cover artwork
- Rich track information including titles, descriptions, and genres
- User engagement metrics like play counts and reactions
- English language content from independent artists
- Released under Creative Commons Zero (CC0) license

Dataset structure includes:
- Track metadata (titles, descriptions, genres, tags)
- Associated media (audio files, cover images)
- Artist information and engagement metrics

Particularly valuable for:
- Music generation model training
- Cross-modal analysis
- Audio classification tasks
- Music style and genre analysis

replied to their post 4 days ago

Thanks! I license almost all of my datasets under CC0, with different modalities and tasks. Maybe somebody can find something else interesting for them in my profile 😉

posted an update 5 days ago

Post

914

🖼️ Introducing Public Domain Pictures Dataset - nyuuzyou/publicdomainpictures

Dataset highlights:
- 644,412 public domain images with comprehensive metadata from publicdomainpictures.net
- English language metadata including titles, descriptions, and keywords
- Each entry contains rich metadata including:
- Unique image ID and full-size image URLs
- Detailed titles and descriptions
- Keyword/tag collections
- Creator attribution
- Released to the public domain under Creative Commons Zero (CC0) license

2 replies

posted an update 12 days ago

Post

2136

🎵 Introducing Suno Music Generation Dataset - nyuuzyou/suno

Dataset highlights:

- 659,788 AI-generated music samples with comprehensive metadata from suno.com
- Multilingual content with English as primary language, including Japanese and other languages
- Each entry contains rich metadata including:
- Unique song ID, audio/video URLs, and thumbnail images
- AI model version and generation parameters
- Song metadata (tags, prompts, duration)
- Creator information and engagement metrics
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset structure includes detailed information about each generated piece, from technical parameters to user engagement metrics, making it particularly valuable for:
- Music generation model training
- Cross-modal analysis (text-to-audio relationships)
- User engagement studies
- Audio classification tasks
- Music style and genre analysis

posted an update 19 days ago

Post

1420

🎓 Introducing Kompy.info Uzbek Educational Dataset - nyuuzyou/kompy

Dataset highlights:
- 584,648 pages of educational content extracted from kompy.info, a comprehensive educational resource website
- Content exclusively in Uzbek language, focusing on technical and scientific topics
- Each entry contains: URL, page title, and extracted main text content
- Data extracted using trafilatura HTML extraction tool
- Covers a wide range of academic and educational materials
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset presents a valuable resource for natural language processing tasks in the Uzbek language, particularly in educational and technical domains. It can be used for text classification, topic modeling, and content analysis of educational materials. The large-scale collection of Uzbek-language academic content makes it especially useful for developing educational technology applications and studying pedagogical approaches in Uzbek-language instruction. The dataset's monolingual nature provides a focused corpus for understanding technical and scientific terminology in Uzbek educational contexts.

Reacted to m-ric's post with 🔥 21 days ago

Post

2339

> Oasis: First Real-Time Video Game Without a Game Engine! 🎮

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

⚡️ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

⚙️ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
⚡️ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
‣ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

🎮 Features real game mechanics
‣ Movement, jumping, item management
‣ Physics and lighting
‣ Procedurally generated worlds

⚠️ Current limitations
‣ Blurry graphics at a distance
‣ Objects sometimes change appearance
‣ Memory issues in long sessions

Try it yourself, the playable demo is impressive! 👉 https://oasis.decart.ai/welcome
Code 👉 https://github.com/etched-ai/open-oasis
Read it in full 👉 https://oasis-model.github.io/

Reacted to Muhammadreza's post with ❤️ 21 days ago

Post

2577

Hey guys.
This is my first post here on huggingface. I'm glad to be a part of this amazing community!

2 replies

posted an update 24 days ago

Post

2739

🎓 Introducing PPT4Web Educational Materials Dataset - nyuuzyou/ppt4web

Dataset highlights:
- 182,405 presentations from ppt4web.ru, a platform for storing and viewing presentations covering a wide range of educational materials
- Primarily in Russian, with content in English, Kazakh, Ukrainian, and Belarusian
- Each entry includes: URL, title, download URL, and filepath
- Contains original PPTX files (converted from PPT for consistency) in addition to metadata
- Data covers a broad spectrum of educational topics and subjects
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in multiple languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings, providing insights into the diverse range of subjects and teaching approaches.

posted an update about 1 month ago

Post

1396

🌐 Introducing Websim.ai User Projects Dataset - nyuuzyou/websim

Dataset highlights:
- 137,452 user projects from Websim.ai, a service for creating small sites using Large Language Models (LLMs)
- Primarily in English, with potential for multilingual content in generated websites
- Each entry includes: project metadata, user information, and generated HTML content
- Contains detailed information about project revisions, site generation, and user interactions
- Data covers a wide range of user-generated website projects created through AI assistance
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing AI-assisted web development trends, studying user behavior in LLM-powered creative tools, and exploring the capabilities of language models in web design.

posted an update about 1 month ago

Post

427

🎓 Introducing Ukr-lit.com.ua Presentations Dataset - nyuuzyou/ukr-lit

Dataset highlights:
- 18,001 presentations from ukr-lit.com.ua, a platform for storing and viewing presentations covering a wide range of subjects in Ukrainian school education
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a broad spectrum of educational topics and subjects taught in Ukrainian schools
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in Ukrainian school education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings, providing insights into the diverse range of subjects and teaching approaches in the Ukrainian school system.

Reacted to erinys's post with 🚀 about 1 month ago

Post

2127

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

Reacted to davidberenstein1957's post with ➕ about 1 month ago

Post

1689

You can now build a custom text classifier without days of human labeling!

👍 LLMs work reasonably well as text classifiers.
👎 They are expensive to run at scale and their performance drops in specialized domains.

👍 Purpose-built classifiers have low latency and can potentially run on CPU.
👎 They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback

posted an update about 1 month ago

Post

838

Today I found out about the existence of utter-project/EuroLLM-1.7B-Instruct and unexpectedly it is really good. I think it's a very underrated model - give it a try nyuuzyou/EuroLLM-1.7B-Instruct

replied to clem's post about 1 month ago

So why isn't OpenAI this list? Are they not supporting open AI? ¯_(ツ)_/¯

posted an update about 1 month ago

Post

1562

🎙 Introducing LiveATC Recordings (Partial 2024-08-26) Dataset - nyuuzyou/liveatc

Dataset highlights:

- 21,172 air traffic control audio recordings from LiveATC.net for August 26, 2024
- Multilingual content, primarily in English with potential for other languages
- Each entry includes: audio file, ICAO airport code, facility type, date, and time
- Contains original MP3 files stored in .tar.zst archives, organized by ICAO airport code
- Data covers various airports and ATC facilities worldwide
- Subject to LiveATC.net's Terms of Use for personal, non-commercial use only

The dataset can be used for audio classification, automatic speech recognition, and analysis of air traffic control communications. The inclusion of recordings from multiple airports allows for comparative analysis across different locations and facility types.

posted an update about 1 month ago

Post

487

🎓 Introducing Svitppt.com.ua Presentations Dataset - nyuuzyou/svitppt

Dataset highlights:
- 18,001 presentations from svitppt.com.ua, a platform for storing and viewing presentations for Ukrainian school students
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of educational topics and presentation materials
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in educational presentation materials and sharing practices in the Ukrainian-speaking student community. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings.

Reacted to takeraparterer's post with 🚀 about 1 month ago

Post

2232

Check this out: I trained an AI on huggingface posts! all of these are AI generated:
----------
Hello!

I'm excited to share that my colleague @felipeebert and I have released the largest Spanish LLM benchmark to date.

We've developed the Spanish LLM Evaluation Benchmark (SLAB), a set of benchmarks designed to evaluate the ability of language models to understand, generate and translate in Spanish.

SLAB includes five different benchmarks:
- Sentiment Analysis: evaluate models' ability to detect and describe sentiment in natural language
- Fact Checking: evaluate models' ability to detect and refute factual errors in text
- Question Answering: evaluate models' ability to answer questions in Spanish
- Open-ended Questions: evaluate models' ability to generate coherent responses in Spanish
- Translation: evaluate models' ability to translate in Spanish

SLAB is aligned with the latest Spanish LLM industry developments and includes the most recent models available on the market. We aim to keep our benchmarks up-to-date and relevant to the Spanish language ecosystem.

SLAB is available at: https://huggingface.co/datasets/argilla/SLAB.

If you would like to collaborate on building additional Spanish LLM benchmarks, let's discuss in the comments.

🔗 SLAB Blog Post: https://argilla.com/blog/slab
----------
Hello everyone,

I'm thrilled to announce the release of

https://huggingface.co/01-AI/01AI-GPT-4o -

A new family of models that brings the power of transformer AI to the masses.

This model is designed to be accessible and easy to use, while still offering high-quality results.

Key features:
- Small model size: only 23M parameters
- Supports text generation, image generation, and text-to-image tasks
- Data-efficient training with a lightweight tokenizer
- Optimized for efficient on-device usage
- Uses the powerful transformer architecture to deliver high-quality results

Excited to see what you all think!

https://huggingface.co/01-AI/01AI-GPT-4o

4 replies

Reacted to huggingface0's post with 🤯 about 1 month ago

Post

3951

1+2=3

2 replies

posted an update about 1 month ago

Post

627

🎓 Introducing Bigslide.ru Presentations Dataset - nyuuzyou/bigslide

Dataset highlights:
- 50,872 presentations from bigslide.ru, a platform for storing and viewing presentations for school students
- Primarily in Russian, with some English and potentially other languages
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of educational topics and presentation materials
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content in Russian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in educational presentation materials and sharing practices in the Russian-speaking student community. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings.