There's a HUGE drop in popular knowledge from v2 to v2.5.
Qwen2 72b scored 73.9 on my popular knowledge test (movies, songs, games, sports...) compared to 77.9 for Llama 3.1 70b, and between 62-64 for Llama 3.1 8b & Gemma2 9b.
However, Qwen2.5 72b scored about the same as Qwen2 7b (~50). I did a vibe check with Qwen2.5 34b & 7b and the same thing is happening (there's a huge drop in general knowledge compared to v2). Thanks for the v2.5 family, but I fail to see how hallucinating like crazy when it comes to what people care most about (popular knowledge) in order to achieve small bumps in test scores (e.g. MMLU) is a reasonable compromise.
Example: "Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?"
Even small LLMs get this one right, including tiny little Gemma 2 2b (Meghan Trainor, Album name is Title, year is 2015) and your 10x smaller Qwen2 7b.
But all Qwen2.5s, including the 72b version (below), got it wrong, and most other similar questions about movies, shows, sports, games, and other areas of non-academic popular knowledge. I tested locally w/ GGUFs, but it still got it wrong on LMsys, so it's unlikely I configured something incorrectly. Qwen2 72b has about the same amount of popular knowledge as Gemma 2 2b despite being ~36x bigger.
Response: "The song "Dear Future Husband" was performed by Mandy Moore. It's from her album Silver Lining, which was released in 2003.
However, it's worth noting that while this version of the song is well-known, a more recent and widely popular version was covered by the band Fifth Harmony. Their version appeared on their album 7/27 and was released in 2016..."
It's crazy how censored this model is. Even censorship lovers from California didn't go that far.
I have not done a ton of testing but this model is awesome. I am super interested in the 32B Coder model. The 7B is very nice already.
@gefit I'm sure the coding versions are good at coding, and the math versions are good at math.
However, a general purpose instruct LLM needs to answer random questions about very popular domains of knowledge that random people come across, such as about top movies, games, music, TV shows, sports, celebrities, popular literature, and so on.
And it's not just about Q&A. How are you supposed to chat with an LLM that's so profoundly ignorant about what the majority of the population cares the most about (it's called pop culture for a reason, it's popular)? How are you supposed to write stories if you don't know the basic facts about what you're writing about?
I really wish LLM makers would stop chasing test scores and bragging about them (e.g. 'we're only 72b parameters but are achieving test scores comparable to a 405b model'). Did you train on better data than Meta? Did you train for longer with more compute? Did you do anything special? Better? Or did you just focus on the data that overlaps with the standardized LLM tests at the expense of the rest of humanity's knowledge. Qwen2 72b is still too error prone and low IQ for academic use, so the small boost in its academic performance really doesn't make it any more usable.
You may have boosted the test scores of Qwen Instruct, but it's now effectively useless as a general purpose LLM to anybody with half a brain and diverse interests.
I can report the same, it doesn't have as good of world knowledge. But translation is best I've seen with open source model, it's really unrivalled expect maybe by Llama 405B but I haven't tried. The question I always ask is "Who is Victor Tsoi and what songs did he release". The output quality is about as good as WizardLM-2-8x22B with this question gets a lot of them right but makes up a good chunk of them
@nonetrix Qwen2.5 72b does a very good with the top 0.01% of pop culture (e.g. cast of the TV show Friends), but there's a sudden drop off, and Viktor is arguably the most famous Russian singer so I'm not surprised Qwen2.5 does well on this question, although it only appeared to get 4 out of the 8 songs it listed correct. I tested it with Llama 3.1 70b Q4 and it got 7/7 correct (it initially listed only 4, so I asked for a few more), but it did make a couple translation errors (e.g. Quiet vs Calm Night).
Llama 3.1 70b: "Viktor Robertovich Tsoi was a Soviet musician, singer, and songwriter who co-founded the Russian rock band Kino. He is widely regarded as one of the most influential and iconic musicians in Russian history.
Some of his most popular songs include:
- "Khochu Peremen" ("I Want Changes")
- "Zvezda Po Imeni Solntse" ("A Star Called Sun")
- "Gruppa Krovi" ("Blood Type")
- "Posledniy Geroi" ("The Last Hero")
These songs are considered some of the greatest hits of Kino and Viktor Tsoi's solo career.
Llama 3.1 70b (list a few more): "..."Pachka Sigaret" ("Pack of Cigarettes")
"Sledi Za Soboi" ("Traces Behind")
"Spokoynaya Noch" ("Quiet Night")..."
Regardless, I ask a large number of these types of questions and even Llama 3.1 8b does far better than Qwen2.5 72b (~52 vs 64.9), which is very odd because Qwen2 72b did far better then L3.1 8b (73.9). There's simply no excuse for a 7b parameter LLM, let alone a 72b parameter one, to be so profoundly ignorant about the lion's share of pop culture.
However, a general purpose instruct LLM needs to answer random questions about very popular domains of knowledge that random people come across, such as about top movies, games, music, TV shows, sports, celebrities, popular literature, and so on.
And it's not just about Q&A. How are you supposed to chat with an LLM that's so profoundly ignorant about what the majority of the population cares the most about (it's called pop culture for a reason, it's popular)? How are you supposed to write stories if you don't know the basic facts about what you're writing about?
Can't echo this enough, it's a shame how much data like this has been disregarded in recent LLM releases... Surely it's possible to include such things without any great detriment to reasoning/coding - even then, I would even be able to put up with that given the existence of dedicated math/code models.
It would be a great differentiator b/w Qwen and other open models if this were to change in the future.
Previous qwen needed pre-fill and other hints to write normally. I'm not sure this one is more censored. It did write like a channer in the demo no problem. Seemed like it knew who genshin characters and vtubers were. Then again, I can't argue with your benchmark tests.
It took finetuning to turbocat/magnum/etc to make the 2.0 shine and produce decent prose. As released it was similarly meh.
@jackboot Massively popular cultural information (the top ~0.01%) can be recovered from Qwen2.5 72b with perfect accuracy, such as the entire casts, and their respective character names, from the TV shows Friends and The Big Bang Theory (most watched shows globally). Same goes for Genshin (wildly popular in China), music legends like Madonna & Michael Jackson, and so on.
Normally LLMs have a smooth knowledge slope, with hallucinations gradually increasing as you ask about progressively less popular culture. This is the case with Qwen2 72b, Llama 3.1 72b, Gemma 2, Mistrals (e.g. Small & Mixtral), and so on.
In contrast, some models use a highly curated corpus to maximize test scores at a given size, most notably Phi 3, Yi, and InternLM. All three have the highest HF scores in their size ranges and can perfectly recall the top ~0.01% of cultural information, but then suddenly have a huge spike in hallucinations, with their ~8b LLM scores ranging from only 35-40 on my pop culture test (Llama 3.1 & Gemma 2 ~8b score 62+). I pasted Yi1.5-9b's Corner Gas cast output below as an example of how far off they are when it comes to reasonably popular culture (a top 5 show in Canada). But again it got the entire Friends and the Big Bang Theory's cast right without a single error. Even Yi1.5-34b only scores 52.2 (not much better despite being much larger than Yi 9b).
Qwen2.5 is now doing the same. They've decided to sacrifice the large bulk of popular culture information in order to add more tokens, and train them longer, that have a higher probability of showing up on standardized LLM tests like the MMLU.
- Hank Pomerleau - Percy Blain
- Norm MacDonald - Norm Udstrand
- Edie Clifton - Lorna Wilson
- Carla Pomerleau - Krysten Henderson
- Arthur Pratt - Paul Hermann Goldsmith (He was a voice actor for Arthur's character)
- Gordon Bell - Andrew Millar
Should be...
Brent Leroy (Brent Butt) - Main Characters
Lacey Burrows (Gabrielle Miller) - Restaurant Owner
Hank Yarbo (Fred Ewanuick) - Friend
Oscar Leroy (Eric Peterson) - Father
Emma Leroy (Janet Wright) - Mother
Davis Quinton (Lorne Cardinal) - Cop
Karen Pelly (Tara Spencer-Nairn) - Cop
Wanda Dollard (Nancy Robertson) - Employee
Fitzy (Cavan Cunningham) - Mayor
That makes sense. It didn't know senzawa or fallen shadow (less popular vtubers) and hallucinated their descriptions. Sad, this kind of data is a gimme and replacing it with repetitive benchmark slop is worthless. People need to stop chasing benchmarks, their credibility is ruined anyway.
I'd like to take a contrarian view on this. Does it really matter? The model is very, very good at RAG/Tools/Agents and you can use this to fill in the blanks on almost anything with the latest information.
A lot of this popular knowledge in most LLMs is out of date by months (sometimes >1yr). Feeding it more pop culture just makes that information more sticky and I find that it prevents the LLM from realizing that it doesn't know and needs to use an agent.
For example, I built a simple search/web-enabled agent and connected to both Qwen2.5 34b/72b instruct models. I then asked question after question about pop culture, for example the new Shogun TV series. It was able to find high level answers to everything using Wikipedia/YouTube/IMDB/etc... What other shows is X in? when was that show released? How many awards did it win at the Emmy's? Which awards? What did the winner say? etc... and it answered using the tools perfectly with up-to-the-minute data.
That said, also noticed this knowledge gap vs. qwen2-72b when I first tried it out, so I totally get the concern.
@CED6688 I agree that for expert users RAG is a good option for Q&A. But the vast majority of users will not be configuring RAG, nor want the latency or dependency on the internet.
Also, RAG has HUGE drawbacks. For example, the output isn't near as coherent. This might be OK for simple Q&A, but when doing things like chatting and writing stories the relevant information simply needs to be contained within the LLM itself.
Also, it's not like they compromised a little world knowledge. Qwen2.5 72b has vastly less than Qwen2 72b, which was already lacking compared to Llama 3.1 70b. And there's countless millions, even billions, of people who care about this lost popular information. And for what? A tiny boost in math, coding... ability. You can get a much bigger boost by using a math, coder... dedicated LLM.
A general purpose instruct LLM simply MUST (I stress MUST) contain the most popular world knowledge to produce coherent outputs across all major use cases, including chatting and story telling. Nothing else, including RAG, will work.
And again, the boost in academic knowledge (e.g. MMLU), coding ability (e.g. HumanEval), math (e.g. GSM8k), IQ (e.g. Arc) etc. was relatively small. It wasn't remotely a reasonable trade-off for a general purpose instruct LLM. Who would you rather interact with, a normal person or a hermit that knows ~10x less about humanity, but has an IQ that's ~5 points higher?
Gonna throw my hat in the ring here and say that as far as I'm concerned QA is not where enthusiast sized local models really shine anyway simply by virtue of size limitations, at least not on their own.
Also, it's not like they compromised a little world knowledge. Qwen2.5 72b has vastly less than Qwen2 72b, which was already lacking compared to Llama 3.1 70b. And there's countless millions, even billions, of people who care about this lost popular information. And for what? A tiny boost in math, coding... ability. You can get a much bigger boost by using a math, coder... dedicated LLM.
A general purpose instruct LLM simply MUST (I stress MUST) contain the most popular world knowledge to produce coherent outputs across all major use cases, including chatting and story telling. Nothing else, including RAG, will work.
Well, good, we can use llama3 for pop questions, qwen for technical tasks. I mean, I don't see the problem with a model optimized for some kind of tasks and skills, as long as we know them.
I agree that for expert users RAG is a good option for Q&A. But the vast majority of users will not be configuring RAG, nor want the latency or dependency on the internet.
The majority of users use chatGPT. Plus front end are a thing, RAG can be accessible. People running 72b models at home that want to not depend on internet connection and a front end are really marginal, and as stated before, can then use llama3.1 if better.
@Handgun1773 , I honestly think the points you're making, along with others like @Simplepotat , are valid and largely correct. For one thing, nearly all casual non-technical users aren't going to use open source AI models when models like Sonnet 3.5 & GPT4x are a web link away, have user friendly interfaces, and a free tier.
However, I'm team open source for various reasons. Most notably, I don't want companies to act as the gatekeepers of information, refusing to do this or that (unless it's clearly illegal like making meth), and I certainly don't want to be lecture for using very popular colloquialisms because they aren't appropriate for little kids. Using the popular proprietary models makes me feel like a little kid interacting with a nanny.
However, I still stand by my assessment that Qwen2.5 72b Instruct is a MAJOR regression. The relatively minor boost in STEM abilities (much less than the test scores indicate) doesn't come close to justifying the massive loss in world knowledge, including the inability to shoot the breeze, write stories... because it doesn't know shit about most popular things.
Also, you suggest just using Llama 3.1 70b, but the open source AI field is competitive and keeps bragging about tests scores (e.g. 'we're only 72b yet match Llama 3.1 405b's MMLU'), so there's a good chance Meta, Google, and Mistral will start suppressing non-STEM popular world knowledge in order to catch up.
Sorry about the long rant but I honestly see this as the beginning of the end of the open source AI community, reducing it to little more than autistic code helpers bragging about matching the test scores of proprietary models at a tiny fraction of their sizes, yet in reality are grossly inferior and vastly more ignorant about most things. And again, this makes no sense to me since open source code and math specific AI models are out there, including Qwen2.5 versions, so why turn Qwen2.5 Instruct models, especially a massive 72b one, into a calculator and coder at the expense of nearly everything else, including the large bulk of humanity's world knowledge?
@phil111 I get your point, but I think that you are over-reacting, in the sense that the world knowledge use-case is real, so I don't see all open source models shifting to maxing code bench. Because enterprises (Alibaba themselves probably) want to use them for different chatbot applications, alongside the agent/tool calling. Diversity of models is good, and I think it will not decrease.
Now, with this release, we know that there might be a compromise to do between world knowledge and reasoning capabilities for a given parameter count. That's a good thing, it further our understanding of LLMs.
And for free by the way, it's meta/alibababa/... that poor millions into it and release them more or less open-source. I am grateful for this.
As for my use-case, I'm really happy with qwen2.5: SOTA small coding/tooling LLM, very good multilingual fine-tuning base.
I also think you make a good contribution by raising this point about those new qwen models, this is valuable for further model development. But I don't agree with the dramatic twist.
@Handgun1773 , I stand by my overly dramatic assessment.
Dramatically spiking hallucinations across most popular domains of world knowledge for relatively small boosts in STEM scores (e.g. the MMLU), that frankly have an almost imperceptible impact on real-world performance (e.g. still makes nearly the same number of obvious math & logic errors as Qwen2 72b), is in my opinion not even a remotely reasonable trade-off, especially when they also released SOTA math and coding models which are notably better at such tasks.
And since the Qwen series is held in high regard within the AI community their decision to strongly favor STEM above all other popular domains of human knowledge puts other open source LLM makers in a very difficult position.
I think your belief that users & other companies (e.g. Meta, Google, and Mistral) will simply accept Qwen's dominance on tests scores at a given parameter count is naive. There really are only two options. 1) also suppress most of humanity's popular knowledge to train primarily on test overlapping STEM knowledge in order to catch up, or (2) users are made aware of the fact that Qwen achieved the higher test scores by sacrificing tons of world knowledge and general purpose use cases. I'm hoping for the latter.
A major consequence of training near exclusively on data that overlaps standardized LLM tests (STEM data), beyond just being profoundly ignorant of the rest of humanity's non-STEM popular knowledge, is failing to fully utilize the parameters of larger models. For example, the 72b version of Qwen2.5 is almost identical to the 32b in my testing. Same goes for HF testing of their base models, as shown below (all tests, not just the average of testing).
One of the biggest problems with the AI first adopter community is that it's comprised nearly exclusively of high IQ autistic coders. This is pushing the open source AI community away from wider adoption as makers start to get rewarded for turning them into little more than coding helpers as all other functionality drifts into a low quality mess of hallucinations and incoherence.
@phil111 you argue about absence knowledge of singers in Qwen2.5 but for me it is even advantage. If I were architect in Qwen development, I would do the same to cut garbage which has no positive progress for humans and filled the training with more real knowledge. If you are sad about it, try Suno or your favorite(censored!) wikipedia. Anyway, in year or two Suno will displace all these useless idols who use mouth to get millions and billions dollars.
@fuckfuckyou11 I'm sorry, but Qwen2.5 is without a doubt a regression, and not just with pop culture. Let me explain.
After it scored much lower than Qwen2 72b on my pop test (68.4 vs 85.9, and that's after redoing parts of my test to make it easier) I decided to test academic STEM knowledge retrieval, which also dropped, but by a smaller amount.
The reason Qwen2.5 scores higher on multiple choice STEM questions (MMLU) than Qwen2 is because it weakly holds more STEM info, which is enough to pick it out from a given list of options, but not enough to retrieve it accurately, or at all.
For example, when I ask it about "Thorne–Żytkow Object (TZO)" it returned "thorne-żytowski star", so it added ski to the name and called it a star vs object, resulting in a partial score reduction in my more nuanced evaluation. However, when given multiple choice options it will still correctly chooses TZO since multiple choice is a much easier nearest match test. So in the end Qwen2.5 72b can score higher than Qwen2 72b on multiple choice tests, but notably lower on direct recollection (more realistic, and how nearly everyone asks an LLM for information).
In conclusion, what the Qwen team did was continue to train Qwen2 on tons of data that overlaps the academic LLM tests (e.g. MMLU & Math), as well as the very narrow and repetitive interests of the like-minded people who test LLMs on LMsys, boosting the scores of Qwen2.5 on said tests while significantly reducing the accuracy of blind information retrieval (how almost everyone asks LLMs for information; people simply never provide the LLM with options to choose from).
It has much more synthetic data especially STEM synthetic data , maybe that is the reason for general regression.
@sparsh35 I think you're right. Meta also trained on synthetic data when making Llama 3.1 70b from Llama 3 70b, and it lost a little world knowledge in the process. But Qwen2.5 lost far more of Qwen2's world knowledge. I think they just trained for too long on too many additional STEM tokens, or were unable to protect existing knowledge like Meta did.
Also it is more difficult to create synthetic diverse world knowledge data, as you don't have verifiers for that as in math through code and code excecution. Meta has a much more bigger budget I think for dataset as compared to Qwen.
Also it is more difficult to create synthetic diverse world knowledge data, as you don't have verifiers for that as in math through code and code excecution. Meta has a much more bigger budget I think for dataset as compared to Qwen.
How did Qwen2.5 get higher results in tests then?
More tokens bigger SFT
How did Qwen2.5 get higher results in tests then?
High-quality 'logical' data seem to have a greater impact on model intelligence than low-quality human knowledge, even if that knowledge is accurate.
Phi has shown this with "Textbooks are all you need" paper.
I think @teknium is a big advocate of sythetic data, and he produced some of the most functionnal finetunes. Guacano days are over, LIMA only applies when you don't have access to a lot of (sythetic) high quality data.
@Handgun1773 Microsoft was wrong. Textbooks are not all you need.
They're all you need if you want high STEM test scores relative to parameter count on simple multiple choice LLM tests which don't remotely correlate with real-world performance.
For example, if you just ask a Phi model like 3.5 to write a story on a given topic (e.g. fishing) it will do and OK job, but it's basically just regurgitating a story it was trained on. However, if you prompt it to write an original story the writing quality not only plummets, but it regularly writes absurd contradictory things, often so absurd it's like a 3 year old was writing it.
This is partly due to its extreme censorship (even to the point of removing contentious tokens from the corpus), making it profoundly ignorant of ubiquitous parts of humanity found in any PG-13 movie.
Anyways, textbook and synthetic quality data aren't all you need unless they're balanced to reflect humanities most popular knowledge, which isn't even remotely the case (they're ~95% STEM, especially coding and math). Additionally, synthetic data obsessively repeats the same phrases and words over and over again, making the responses far less creative and adaptable. The entire Phi series is less than useless to >95% of the world's population (vomits hallucinations, or otherwise errors, about things they care most about), and Qwen2.5 is a HUGE regression from Qwen2 to >95% of the world's population. Most here can't see that because they're too narrowly focused on coding & STEM.