Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks. - Leaderboard https://lt-asset.github.io/REPOCOD/ - Dataset: lt-asset/REPOCOD @jiang719@shanchao@Yiran-Hu1007 Compared to #SWEBench, RepoCod tasks are - General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues. - With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).
Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with - Whole function generation - Repository-level context - Validation with test cases, and - Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)
Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with: - 67 repository-level, 67 file-level, and 66 self-contains tasks - Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens) - GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks. - Dataset: lt-asset/REPOCOD_Lite
What a week! A recap for everything you missed ❄️ merve/nov-22-releases-673fbbcfc1c97c4f411def07 Multimodal ✨ > Mistral AI released Pixtral 124B, a gigantic open vision language model > Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU > OpenGVLab released MMPR: a new multimodal reasoning dataset > Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings > Apple released new SotA vision encoders AIMv2
LLMs 🦙 > AllenAI dropped a huge release of models, datasets and scripts for Tülu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR > Jina has released embeddings-v3: new multilingual embeddings with longer context > Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning > Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs
Image Generation 🖼️ > Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations
Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them 📚 $ pip install observers
The outcome is quite sad, as a Frenchman and European.
The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.
American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬
⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
✨Fine-tuned with CoT data (open-source + synthetic). ✨Expands solution space with MCTS, guided by model confidence. ✨Novel reasoning strategies & self-reflection enhance complex problem-solving. ✨Pioneers LRM in multilingual machine translation.