12 36 25

Mohammed Hamdy

mmhamdy

AI & ML interests

NLP | Reinforcement Learning

Recent Activity

upvoted a collection 2 days ago

Models for dataset curation

liked a dataset 3 days ago

HuggingFaceTB/smoltalk

Reacted to andito's post with 👀 4 days ago

Hugging face presents FineVideo 🎥! Unlocking the next generation of Video understanding 🚀 🤯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs. 🔥 @mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection. Very psyched to fine-tune idefics on this dataset. ⚡️ Explore the videos: https://huggingface.co/spaces/HuggingFaceFV/FineVideo-Explorer

View all activity

Organizations

mmhamdy's activity

upvoted a collection 2 days ago

Models for dataset curation

Collection

7 items • Updated 2 days ago • 17

liked a dataset 3 days ago

HuggingFaceTB/smoltalk

Viewer • Updated 3 days ago • 2.2M • 469 • 100

Reacted to andito's post with 👀 4 days ago

Post

1036

Hugging face presents FineVideo 🎥! Unlocking the next generation of Video understanding 🚀

🤯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs.
🔥
@mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.

Very psyched to fine-tune idefics on this dataset. ⚡️
Explore the videos: HuggingFaceFV/FineVideo-Explorer

Reacted to cfahlgren1's post with ❤️ 4 days ago

Post

2850

You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned

1 reply

Reacted to davanstrien's post with ❤️ 5 days ago

Post

1237

huggingface.co/DIBT is dead!

Long live https://huggingface.co/data-is-better-together!

We're working on some very cool projects so we're doing a bit of tidying of the Data is Better Together Hub org 🤓

Reacted to LukeNeumann's post with 🔥 8 days ago

Post

1840

Hello Hugging Face community!

I wanted to introduce myself and my company @Overlaiapp . We are a collective of filmmakers, photographers, and AI engineers working on high resolution (8K+) training data.

We plan to share a lot of our datasets with the community and are kicking things off with two curated datasets:

- Overlaiai/OregonCoastin4K

- Overlaiai/SubArcticPolarBear

Overlai.ai Dataset Features

🎥 Oversampled: Every clip is captured in stunning 8K resolution, delivering rich detail ideal for fine tuning scenic landscapes and ocean dynamics.

📸 Variance: Includes close-up details, slow-motion footage of crashing waves, sweeping landscapes, and wildlife shots.

📋 Detailed Metadata: Every clip is paired with structured metadata, including creative descriptions, precise camera movements, lens information, field of view calculations, and shot settings, ensuring AI models can fully understand and replicate real-world cinematography with accuracy.

⚙️ Consistency: Re-thinking training data at the point of capture by "overshooting" a subject, enabling models to learn more nuanced relationships and views across scenes.

🌅 Light: Shot during early morning and sunset light for optimal color contrast and dynamic range, maximizing visual quality for color and lighting-sensitive tasks.

🔍 Curation: Curated specifically for machine learning, providing clean, high-quality data for next generation model training.

upvoted 2 collections 13 days ago

Florence

Collection

9 items • Updated Jul 11 • 160

Biomedical

Collection

Models for biomedical research applications, such as radiology report generation and biomedical language understanding. • 9 items • Updated 23 days ago • 4

upvoted a paper 13 days ago

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Paper • 2411.04872 • Published 17 days ago • 4

upvoted a collection 24 days ago

SmolLM2

Collection

State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 10 items • Updated 3 days ago • 177

Reacted to m-ric's post with 🚀 about 2 months ago

Post

2264

💥 𝐋-𝐌𝐮𝐥: 𝐀𝐝𝐝𝐢𝐭𝐢𝐨𝐧-𝐎𝐧𝐥𝐲 𝐌𝐮𝐥𝐭𝐢𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐜𝐚𝐧 𝐬𝐥𝐚𝐬𝐡 𝐜𝐨𝐦𝐩𝐮𝐭𝐚𝐭𝐢𝐨𝐧𝐚𝐥 𝐜𝐨𝐬𝐭𝐬 𝐛𝐲 𝟖𝟎%!

Microsoft researchers dropped a groundbreaking technique that could slash the energy use in transformer computations : their novel "linear-complexity multiplication" (L-Mul) algorithm approximates floating-point multiplication using energy-efficient integer addition instead of costly multiplications.

💡 Quick reminder on how floats are coded on 8 bits (FP8):
In the e4m3 FP8 standard, you encode a number as:
Sign (1 bit) | Exponent (4 bits) | Mantissa (3 bits)
Example: 0 (positive) | 1000 (8) | 101 (1/2 + 1/8 = 0.625)
Calculation: you add one to the mantissa, and multiply it by 2 power (the exponent - a bias term which is 7 for e4m3):

➡️ You get (1 + 0.625) × 2^(8-7) = 3.25

Now back to the paper. 𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:

⚡️ Multiplication is extremely energy-intensive compared to addition. For 32-bit operations, multiplication (3.7 pJ) uses 37x more energy than addition (0.1 pJ)!

🧮 Traditional floating-point multiplication go like (noting xm the mantissa and xe the exponent): Mul(x,y) = (1 + xm) · 2^xe · (1 + ym) · 2^ye = (1 + xm + ym + xm · ym) · 2^(xe+ye)

💡 L-Mul cleverly approximates this as: L-Mul(x,y) = (1 + xm + ym + 2^-l(m)) · 2^(xe+ye), eliminating the costly xm · ym term

🔧 l(m) term is adaptively set based on mantissa size for optimal accuracy

📊 Benchmarks on the Llama-3.1-8B-Instruct model show L-Mul preserves precision across various NLP tasks, with performance nearly identical to full BFloat16 precision

💬 Authors claim: "We can achieve the same model inference performance while reducing the energy cost of attention computations by 80%."

This breakthrough is still theoretical and would need implementation on dedicated hardware to confirm real-world gains, but it’s a really exciting path for more sustainable AI! 🌱

Read the paper here 👉 Addition is All You Need for Energy-efficient Language Models (2410.00907)

posted an update about 2 months ago

Post

1832

🔗 Evaluating Long Context #1: Long Range Arena (LRA)

Accurately evaluating how well language models handle long contexts is crucial, but it's also quite challenging to do well. In this series of posts, we're going to examine the various benchmarks that were proposed to assess long context understanding, starting with Long Range Arens (LRA)

Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation.

📌 Key Features of LRA

1️⃣ Diverse Tasks: The LRA benchmark consists of a suite of tasks designed to evaluate model performance on long sequences ranging from 1,000 to 16,000 tokens. These tasks encompass different data types and modalities: Text, Natural and Synthetic Images, and Mathematical Expressions.

2️⃣ Synthetic and Real-world Tasks: LRA is comprised of both synthetic probing tasks and real-world tasks.

3️⃣ Open-Source and Extensible: Implemented in Python using Jax and Flax, the LRA benchmark code is publicly available, making it easy to extend.

📌 Tasks

1️⃣ Long ListOps

2️⃣ Byte-level Text Classification and Document Retrieval

3️⃣ Image Classification

4️⃣ Pathfinder and Pathfinder-X (Long-range spatial dependency)

👨‍💻 Long Range Arena (LRA) Github Repository: https://github.com/google-research/long-range-arena

📄 Long Range Arena (LRA) paper: Long Range Arena: A Benchmark for Efficient Transformers (2011.04006)

liked a dataset 2 months ago

KbsdJames/Omni-MATH

Viewer • Updated Oct 12 • 4.43k • 585 • 58

Reacted to davidberenstein1957's post with ❤️🔥🚀 2 months ago

Post

2146

🎉 Exciting News: Argilla 2.2.0 is Here! 🚀

We're thrilled to announce the release of Argilla 2.2.0, packed with powerful new features to enhance your data annotation and LLM workflow:

🗨️ ChatField: Work with text conversations natively in Argilla. Perfect for building datasets for conversational LLMs!
⚙️ Adjustable Task Distribution: Modify settings on the fly and automatically recalculate completed and pending records.
📊 Progress Tracking: Monitor annotation progress directly from the SDK, including user-specific metrics.
🧠 Automatic Settings Inference: Importing datasets from Hugging Face Hub just got easier with automatic settings detection.
📋 Task Templates: Jump-start your projects with pre-built templates for common dataset types.
🔧 Background Jobs Support: Improved performance for long-running tasks (requires Redis).

Upgrade now and supercharge your data workflows!

Check out our full changelog for more details: https://github.com/argilla-io/argilla/compare/v2.1.0...v2.2.0

upvoted a collection 2 months ago

LLM Reasoning Papers

Collection

Papers to improve reasoning capabilities of LLMs • 15 items • Updated 22 days ago • 76

upvoted a collection 3 months ago

"Physics of Language Models" series

Collection

6 items • Updated Aug 30 • 37

upvoted a paper 3 months ago

inftyBench: Extending Long Context Evaluation Beyond 100K Tokens

Paper • 2402.13718 • Published Feb 21 • 1

commented a paper 3 months ago

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Paper • 2402.13718 • Published Feb 21 • 1 •