Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 15

view post
Post
385
๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ซ๐ž๐š๐œ๐ก๐ž๐ฌ ๐ญ๐ก๐ž ๐ญ๐จ๐ฉ ๐จ๐Ÿ ๐†๐€๐ˆ๐€ ๐ฅ๐ž๐š๐๐ž๐ซ๐›๐จ๐š๐ซ๐! ๐Ÿฅณ

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the ๐—š๐—”๐—œ๐—” ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

๐Ÿ› ๏ธ ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐˜‚๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ผ๐—น๐˜€, at least a web browser
๐Ÿ”ข ๐—ฅ๐—ถ๐—ด๐—ผ๐—ฟ๐—ผ๐˜‚๐˜€ ๐—น๐—ผ๐—ด๐—ถ๐—ฐ, many questions having strong math aspects
๐Ÿ–ผ๏ธ ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น, the agent had to handle all file types: ๐Ÿ”Š, ๐Ÿ–ผ๏ธ, ๐ŸŽฌ...
๐Ÿ‘ฃ ๐— ๐˜‚๐—น๐˜๐—ถ-๐˜€๐˜๐—ฒ๐—ฝ, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard ๐Ÿ˜ณ
> "In NASAโ€™s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(๐˜ฏ๐˜ฐ ๐˜ง๐˜ช๐˜ญ๐˜ฆ ๐˜ข๐˜ต๐˜ต๐˜ข๐˜ค๐˜ฉ๐˜ฆ๐˜ฅ ๐˜ฐ๐˜ง ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ด๐˜ฆ, ๐˜ต๐˜ฉ๐˜ฆ ๐˜ข๐˜จ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฉ๐˜ข๐˜ด ๐˜ต๐˜ฐ ๐˜ง๐˜ช๐˜ฏ๐˜ฅ ๐˜ข๐˜ญ๐˜ญ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ช๐˜ฏ๐˜ง๐˜ฐ)

โžก๏ธ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿš€ Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
๐Ÿฅ‡ On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

๐™‚๐™ค ๐™˜๐™๐™š๐™˜๐™  ๐™ค๐™ช๐™ฉ ๐™ฉ๐™๐™š ๐™ก๐™š๐™–๐™™๐™š๐™ง๐™—๐™ค๐™–๐™ง๐™™ ๐Ÿ‘‰ gaia-benchmark/leaderboard
view post
Post
2971
๐Ÿ’ฐ ๐—š๐—ฒ๐˜ ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ถ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐—ฎ๐—ป๐˜† ๐—Ÿ๐—Ÿ๐—  ๐—”๐—ฃ๐—œ ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ฒ๐˜€๐˜ โ‡’ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐—ฐ๐—ผ๐˜€๐˜

I've just found out about ๐™ฐ๐š๐šŽ๐š—๐š๐™พ๐š™๐šœ-๐™ฐ๐™ธ/๐š๐š˜๐š”๐šŽ๐š—๐šŒ๐š˜๐šœ๐š (https://github.com/AgentOps-AI/tokencost).
๐—ง๐—ต๐—ถ๐˜€ ๐—น๐—ถ๐—ฏ๐—ฟ๐—ฎ๐—ฟ๐˜† ๐—ด๐—ถ๐˜ƒ๐—ฒ๐˜€ ๐˜†๐—ผ๐˜‚ ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ถ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฐ๐—ฎ๐—น๐—น๐˜€ ๐˜๐—ผ ๐—ฎ๐—ป๐˜† ๐—Ÿ๐—Ÿ๐—  ๐—”๐—ฃ๐—œ: OpenAI, Anthropic, Mistral, AWS or Databricks...

For any model, you can use as input either string prompts or messages, and get as outputs either the price or token count.

Congrats to the AgentOps-AI team: this will be very useful when trying to get a ballpark estimate of a project's price, to compare APIs, or for precise monitoring of usage!

โœจ Daily reminder: ๐—ฟ๐˜‚๐—ป๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ป ๐—”๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—ฐ๐—ผ๐˜€๐˜๐˜€ ๐˜†๐—ผ๐˜‚ ๐—ฒ๐˜…๐—ฎ๐—ฐ๐˜๐—น๐˜† $๐Ÿฌ.๐Ÿฌ๐Ÿฌ/๐—ต๐—ผ๐˜‚๐—ฟ (or 0.00โ‚ฌ in current exchange rates) on a HF space with ZeroGPU!
Learn more on ZeroGPU ๐Ÿ‘‰ https://www.datacenterdynamics.com/en/news/hugging-face-launches-zerogpu-project-to-democratize-ai-gives-away-10-million-worth-of-compute/

models

None public yet