Adina Yakefu

AdinaY

AI & ML interests

None yet

Recent Activity

View all activity

Articles

Organizations

AdinaY's activity

Reacted to lin-tan's post with 🔥 1 day ago
view post
Post
708
Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks.
- Leaderboard https://lt-asset.github.io/REPOCOD/
- Dataset: lt-asset/REPOCOD
@jiang719 @shanchao @Yiran-Hu1007
Compared to #SWEBench, RepoCod tasks are
- General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues.
- With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).

Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with
- Whole function generation
- Repository-level context
- Validation with test cases, and
- Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)

Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with:
- 67 repository-level, 67 file-level, and 66 self-contains tasks
- Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens)
- GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks.
- Dataset: lt-asset/REPOCOD_Lite

#LLM4code #LLM #CodeGeneration #Security
  • 1 reply
·
Reacted to merve's post with 🔥 1 day ago
view post
Post
1280
What a week! A recap for everything you missed ❄️
merve/nov-22-releases-673fbbcfc1c97c4f411def07
Multimodal ✨
> Mistral AI
released Pixtral 124B, a gigantic open vision language model
> Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU
> OpenGVLab released MMPR: a new multimodal reasoning dataset
> Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings
> Apple released new SotA vision encoders AIMv2

LLMs 🦙
> AllenAI dropped a huge release of models, datasets and scripts for Tülu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR
> Jina has released embeddings-v3: new multilingual embeddings with longer context
> Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning
> Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs

Image Generation 🖼️
> Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations

Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them 📚
$ pip install observers
Reacted to m-ric's post with ❤️ 1 day ago
view post
Post
618
Made a new app to visualize the LLM race ⇒ 𝗡𝗼 𝗘𝘂𝗿𝗼𝗽𝗲𝗮𝗻 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝗶𝗻 𝘁𝗵𝗲 𝘁𝗼𝗽 𝟭𝟬 🇪🇺❌

See the app here 👉 m-ric/llm-race-to-the-top

I've adapted an app by @andrewrreed that tracks progress of LLMs ( andrewrreed/closed-vs-open-arena-elo), on the Chatbot Arena leaderboard, to compare companies from different countries.

The outcome is quite sad, as a Frenchman and European.

The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.

American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬

⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
Reacted to THUdyh's post with 🔥👀 1 day ago
posted an update 1 day ago