Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
4
6
Crispin Almodovar
calmodovar
Follow
bcnailab's profile picture
1 follower
·
3 following
AI & ML interests
NLP, log anomaly detection, cyber intelligence
Recent Activity
upvoted
an
article
22 days ago
Visually Multilingual: Introducing mcdse-2b
Reacted to
singhsidhukuldeep
's
post
with 👀
about 1 month ago
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need." The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance. The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns. How? - It uses two separate softmax attention maps and subtracts them. - It employs a learnable scalar λ for balancing the attention maps. - It implements GroupNorm for each attention head independently. - It is compatible with FlashAttention for efficient computation. What do you get? - Superior long-context modeling (up to 64K tokens). - Enhanced key information retrieval. - Reduced hallucination in question-answering and summarization tasks. - More robust in-context learning, less affected by prompt order. - Mitigation of activation outliers, opening doors for efficient quantization. Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters. This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?
View all activity
Organizations
calmodovar
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
liked
5 models
7 months ago
tenyx/Llama3-TenyxChat-70B
Text Generation
•
Updated
May 8
•
3.16k
•
63
NousResearch/Hermes-2-Pro-Llama-3-8B
Text Generation
•
Updated
Sep 14
•
32.7k
•
407
NousResearch/Genstruct-7B
Text Generation
•
Updated
Mar 7
•
293
•
371
gradientai/Llama-3-8B-Instruct-262k
Text Generation
•
Updated
26 days ago
•
9.19k
•
252
gradientai/Llama-3-8B-Instruct-Gradient-1048k
Text Generation
•
Updated
26 days ago
•
19.4k
•
674
liked
a dataset
10 months ago
LDJnr/Capybara
Viewer
•
Updated
Jun 7
•
16k
•
422
•
226