Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks. - Leaderboard https://lt-asset.github.io/REPOCOD/ - Dataset: lt-asset/REPOCOD @jiang719@shanchao@Yiran-Hu1007 Compared to #SWEBench, RepoCod tasks are - General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues. - With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).
Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with - Whole function generation - Repository-level context - Validation with test cases, and - Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)
Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with: - 67 repository-level, 67 file-level, and 66 self-contains tasks - Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens) - GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks. - Dataset: lt-asset/REPOCOD_Lite
What a week! A recap for everything you missed ❄️ merve/nov-22-releases-673fbbcfc1c97c4f411def07 Multimodal ✨ > Mistral AI released Pixtral 124B, a gigantic open vision language model > Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU > OpenGVLab released MMPR: a new multimodal reasoning dataset > Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings > Apple released new SotA vision encoders AIMv2
LLMs 🦙 > AllenAI dropped a huge release of models, datasets and scripts for Tülu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR > Jina has released embeddings-v3: new multilingual embeddings with longer context > Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning > Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs
Image Generation 🖼️ > Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations
Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them 📚 $ pip install observers
The outcome is quite sad, as a Frenchman and European.
The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.
American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬
⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
✨Fine-tuned with CoT data (open-source + synthetic). ✨Expands solution space with MCTS, guided by model confidence. ✨Novel reasoning strategies & self-reflection enhance complex problem-solving. ✨Pioneers LRM in multilingual machine translation.
🌍 The leaderboard is available in both Japanese and English 📚 Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs 📊 The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners 💻 For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio 🔬 The radar chart provides a very interesting visualization of metrics! 🌱 We are using the Japanese research platform, MDX, so please be patient! ⚡ LLMs bigger than +70B will be evaluated soon…
How do you say “GPUs Go Brrr” in Japanese - > GPUがブンブン~! (To pronounce "GPU ga bunbun!") 🔥
✨ Unified 3D generation & text understanding. ✨ 3D meshes as plain text for seamless LLM integration. ✨ High-quality 3D outputs rivaling specialized models.
Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents Nexusflow/athene-v2-6735b85e505981a794fb02cc
Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed microsoft/orca-agentinstruct-1M-v1
Models 💻 Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! 💗
🖼️ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers 👏
🖼️ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts 👏 Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search
🎮 AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay 🤯
Datasets Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture 📖
Lotus 🪷 is a new foundation model on monocular depth estimation ✨ Compared to previous diffusion-based MDE models, Lotus is modified for dense prediction tasks Authors also released a model for normal prediction 🤗 Find everything in this collection merve/lotus-6718fb957dc1c85a47ca1210
🌍 I’ve always had a dream of making AI accessible to everyone, regardless of location or language. However, current open MLLMs often respond in English, even to non-English queries!
🚀 Introducing Pangea: A Fully Open Multilingual Multimodal LLM supporting 39 languages! 🌐✨
The Pangea family includes three major components: 🔥 Pangea-7B: A state-of-the-art multilingual multimodal LLM capable of 39 languages! Not only does it excel in multilingual scenarios, but it also matches or surpasses English-centric models like Llama 3.2, Molmo, and LlavaOneVision in English performance.
📝 PangeaIns: A 6M multilingual multimodal instruction tuning dataset across 39 languages. 🗂️ With 40% English instructions and 60% multilingual instructions, it spans various domains, including 1M culturally-relevant images sourced from LAION-Multi. 🎨
🏆 PangeaBench: A comprehensive evaluation benchmark featuring 14 datasets in 47 languages. Evaluation can be tricky, so we carefully curated existing benchmarks and introduced two new datasets: xChatBench (human-annotated wild queries with fine-grained evaluation criteria) and xMMMU (a meticulously machine-translated version of MMMU).