David Berenstein

davidberenstein1957

AI & ML interests

Everything NLP and knowledge graphs

Articles

Organizations

davidberenstein1957's activity

replied to m-ric's post 1 day ago
reacted to m-ric's post with ๐Ÿ‘€โค๏ธ๐Ÿ”ฅ 1 day ago
view post
Post
3409
๐—ง๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—ฏ๐—ถ๐—ด ๐˜€๐—ผ๐—ฐ๐—ถ๐—ฎ๐—น ๐—ป๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—ป๐—ผ๐˜ ๐Ÿฆ‹, ๐—ถ๐˜'๐˜€ ๐—›๐˜‚๐—ฏ ๐—ฃ๐—ผ๐˜€๐˜๐˜€! [INSERT STONKS MEME WITH LASER EYES]

See below: I got 105k impressions since regularly posting Hub Posts, coming close to my 275k on Twitter!

โš™๏ธ Computed with the great dataset maxiw/hf-posts
โš™๏ธ Thanks to Qwen2.5-Coder-32B for showing me how to access dict attributes in a SQL request!

cc @merve who's far in front of me
ยท
reacted to their post with ๐Ÿ‘€ 13 days ago
view post
Post
676
The Synthetic Data Generator now directly integrates with Argilla, so you can generate and curate your own high-quality datasets from pure natural language!

Up next -> include dataset generation for text classification.
Other suggestions? Let us know.

Space: argilla/synthetic-data-generator


reacted to their post with โž•โค๏ธ 13 days ago
view post
Post
1682
You can now build a custom text classifier without days of human labeling!

๐Ÿ‘ LLMs work reasonably well as text classifiers.
๐Ÿ‘Ž They are expensive to run at scale and their performance drops in specialized domains.

๐Ÿ‘ Purpose-built classifiers have low latency and can potentially run on CPU.
๐Ÿ‘Ž They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback
posted an update 13 days ago
view post
Post
2056
Import any dataset from the Hub and configure your labeling tasks without needing any code!

Really excited about extending the Hugging Face Hub integration with many more streamlined features and workflows, and we would love to hear your feedback and ideas, so don't feel shy and reach out ๐Ÿซถ๐Ÿฝ

https://huggingface.co/blog/argilla-ui-hub
reacted to their post with ๐Ÿ‘€๐Ÿš€๐Ÿค— 13 days ago
view post
Post
3062
Vector Search (most) datasets on the Hugging Face Hub ๐Ÿ”ฆ

Powered by: Polars, DuckDB, Gradio and model2vec (lightning-fast embeddings by Stรฉphan Tulkens).

Should work fast enough for datasets up to 100K.

davidberenstein1957/vectorsearch-hub-datasets
posted an update 13 days ago
view post
Post
3062
Vector Search (most) datasets on the Hugging Face Hub ๐Ÿ”ฆ

Powered by: Polars, DuckDB, Gradio and model2vec (lightning-fast embeddings by Stรฉphan Tulkens).

Should work fast enough for datasets up to 100K.

davidberenstein1957/vectorsearch-hub-datasets
posted an update 19 days ago
view post
Post
1737
โšก๏ธ LLMs do a good job at NER, but don't you want to do learn how to do more with less?

Go from ๐Ÿข -> ๐Ÿ‡

If you want a small model to perform well on your problem, you need to fine-tune it.

Bootstrap with a teacher model.

Correct potential mistakes to get high-quality data.

Fine-tune your student model

Go more accurate and more efficient.

Free signup: https://lu.ma/zx2t7irs
posted an update about 1 month ago
view post
Post
1682
You can now build a custom text classifier without days of human labeling!

๐Ÿ‘ LLMs work reasonably well as text classifiers.
๐Ÿ‘Ž They are expensive to run at scale and their performance drops in specialized domains.

๐Ÿ‘ Purpose-built classifiers have low latency and can potentially run on CPU.
๐Ÿ‘Ž They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback
reacted to nroggendorff's post with ๐Ÿ˜Ž about 1 month ago
view post
Post
1251
100 followers? When did that happen?
reacted to m-ric's post with ๐Ÿ‘€ about 1 month ago
view post
Post
1685
By far the coolest release of the day!
> The Open LLM Leaderboard, most comprehensive suite for comparing Open LLMs on many benchmarks, just released a comparator tool that lets you dig into the detail of differences between any models.

Here's me checking how the new Llama-3.1-Nemotron-70B that we've heard so much compares to the original Llama-3.1-70B. ๐Ÿค”๐Ÿ”Ž

Try it out here ๐Ÿ‘‰ open-llm-leaderboard/comparator
  • 2 replies
ยท
posted an update about 1 month ago
view post
Post
676
The Synthetic Data Generator now directly integrates with Argilla, so you can generate and curate your own high-quality datasets from pure natural language!

Up next -> include dataset generation for text classification.
Other suggestions? Let us know.

Space: argilla/synthetic-data-generator


posted an update about 1 month ago
view post
Post
2495
Don't use an LLM when you can use a much cheaper model.

The problem is that no one tells you how to actually do it.

Just picking a pre-trained model (e.g., BERT) and throwing it at your problem won't work!

If you want a small model to perform well on your problem, you need to fine-tune it.

And to fine-tune it, you need data.

The good news is that you don't need a lot of data but instead high-quality data for your specific problem.

In the latest livestream, I showed you guys how to get started with Argilla on the Hub! Hope to see you at the next one.

https://www.youtube.com/watch?v=BEe7shiG3rY
posted an update about 1 month ago
view post
Post
1213
Thursday 10 October 17:00 CEST, I will show a good way to get started with a text classification project on the Hugging Face Hub with Argilla and Setfit.

Signup here: https://lu.ma/31mecp34
reacted to their post with ๐Ÿ”ฅ about 2 months ago