Concerns regarding design decisions based on purely academic benchmarks.

#1
by deleted - opened
deleted
edited Jun 6

Firstly, thanks, this is a solid model.

However, I'm concerned because LLMs are starting focus too heavily on academic knowledge that overlaps the MMLU. This doesn't just including Qwen, but others like Yi and Phi3, but not Llama and Mistral/Mixtral.

For example, when you ask about very popular things known to countless millions the hallucinations are beyond belief, such as in the example below for the cast of the popular Canadian TV show Corner Gas. This isn't isolated. The less likely information is to be including on the MMLU, no matter how popular said knowledge is, the greater the spike in hallucinations.

"1. Brent McCallum - played by Kevin Heffernan
2. Emma Jansen - played by Lisa Merchant
3. Lorne Munroe - played by Scott Thompson
4. Ruth-Anne Davis - played by Janet Wright
5. Hank Kimble - played by Graham Greene
6. Emma's Father - played by Paul Gross"

The true information is below.

Brent Leroy (Brent Butt) - Main Characters
Lacey Burrows (Gabrielle Miller) - Restaurant Owner
Hank Yarbo (Fred Ewanuick) - Friend
Oscar Leroy (Eric Peterson) - Father
Emma Leroy (Janet Wright) - Mother
Davis Quinton (Lorne Cardinal) - Cop
Karen Pelly (Tara Spencer-Nairn) - Cop
Wanda Dollard (Nancy Robertson) - Employee
Fitzy (Cavan Cunningham) - Mayor

deleted changed discussion status to closed
deleted

To be perfectly clear, I'm not trying to suggest that the MMLU results aren't accurate, but only that the knowledge of 3 major LLM series (Phi, Yi, and Qwen) spikes when it comes to what overlaps with the MMLU (academia and STEM), and plummets when it comes to all other popular knowledge.

While this may be a reasonable design choice it's nothing to brag about. Any company, like Meta and Mistral, could have easily focused on the subset of popular information that overlaps with the MMLU to significantly boost their MMLU score with the same amount of training and parameters.

The reason I personally feel it's a bad design choice is because the vast majority of the population want very popular information from LLMs that relatively large Phis, Yis, and Qwens (e.g. 34b) have far less of than 7b Mistral and 8b Llama 3.

Another example is with music. One of my simple questions is "Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?". Most of my questions are deliberately confusing, but this is a freebee. It's a very popular song (even has a music video) with a distinct and fully given title, and with a couple qualifiers (modern & hit). M7b & L38b gets all such questions correct.

But Qwen outputs things like "The modern hit song "Dear Future Husband" was sung by American singer Kehlani. The song is featured on her second studio album, "SweetSexySavage," which was released in 2017."

And this attempt to redirect to the limited knowledge it has keeps popping up. For example, when I asked about the cast of The Fifth Element it said you mean the Sixth Sense, then gave me that cast with lots of errors.

But again, Qwen2 is a strong model. It just knows very little unless it overlaps with the MMLU. So although its MMLU is 70.4, its overall knowledge is only comparable to a Mistral or LLama 3 with an MMLU of ~55.

Honestly, I think you can keep the discussion open for once, you're describing the disadvantages of certain design choices, which provides great feedback for the people pretraining these LLMs. And because taking that feedback into account is hard and expensive, being able to talk about it and having the details figured out is important. I think it's unfair to always assume people won't listen. And there is literally no drawback to always giving them the option to discuss. The community tab isn't meant to only be a comment section after all. Just maybe change the Title to "Concerns regarding design decisions based on purely academic benchmarks." The MMLU isn't misleading if you only want to measure academic performance, it's only misleading if you want to use it as an indication of general knowledge or intelligence.

deleted

@nlpguy I like your title recommendation so I'm going to steal it.

The primary reason I close discussions is because I'm a fish out of water. I'm not a programmer beyond simple scripts, LLM fine-tuner, or even a competent user (many of my complaints ended up being because I configured something wrong).

It's just that I'm becoming progressively more concerned with LLMs like Phi, Yi, and Qwen that know practically nothing about very popular information just because it's not academic. I realize such information doesn't improve reasoning or language skills like high brow academic text does, but anything that's popular is high value simply because countless millions care about it.

deleted changed discussion title from The MMLU is misleading. to Concerns regarding design decisions based on purely academic benchmarks.

And I agree with your statement, I've been working on an llm pretraining mixture myself and have been trying to include popular and cultural information. But that doesn't mean i have any idea what I'm saying. it's not important how much skill anybody has or how much they know here in my opinion, because nobody knows everything. Instead, it's important to discuss things here, present theories, make observations, to be welcoming and not to be dismissive even when something is wrong from time to time. Then we can work together and solve these things step by step.

deleted

The results of my non-MMLU popular knowledge test for Qwen2 7b was 50.8.

For comparison, Mistral 7b v0.3 got 64.4, Llama 3 8b got 69.4, and Mixtral 8x7b v0.1 got 74.8, so +15, +20, and +25 points, respectfully.

Also, it knocks the easy ones out of the park. For example, it got every character, and the actors who portrayed them, correct for the most watched TV shows Friends and The Big Bang Theory. But did progressively much worse as popularity dropped.

For example, for the movie Four Rooms it didn't get anything right.

"1. Mr. Babadook - Tom Selleck
2. The Kid - Ted Raimi
3. The Old Man - Giancarlo Giannini
4. The Girl - Jenna Elfman
5. The Woman - Lolita Davidovich
6. The Waitress - Rosanna Arquette"

And for a Two and a Half Men question about Alan's ex-wives it didn't even answer with characters or actors from the show.

"The two actresses who played the roles of the ex-wives of Alan Harper (played by Jon Cryer) in "Two and a Half Men" were Constance Zimmer as Dr. Karen Espinosa and Sarah Shahi as Jamie Buchman. "

The fact that Qwen2 performs perfectly (not one hallucination) on a handful of extremely popular shows and movies seen by countless billions, but suddenly does vastly worse on moderately less popular movies, shows, singers, celebrities..., often returning 100% hallucinations, clearly shows that they, like Yi and Phi, it not only aggressively filtered out data from webrips, but also Wikipedia.

deleted changed discussion status to open

How does it do on Chinese pop culture? Same problem?

deleted

@jackboot That's a good point. I'd be interested if anyone can answer that question. All my prompts are in English, and primarily about English popular culture from the US, UK, Canada, Australia..., with a little from other cultures, such as the Japanese movie "Ghost in the Shell".

It's understandable that a Chinese model wouldn't focus on English. However, it is an English/Chinese LLM, and it got a high English MMLU score of 70 (along with high scores on other English tests like WinoGrande), and English pop culture dominates far more globally than any other, including the global movie and music industry, as well as the bulk of the internet. Point being, English speakers are going to see the high English MMLU, WinoGrande... (higher than L3 8b) and make the wrong assumption about how well it will perform with the most popular English information.

I'll leave it to Chinese speakers to determine if Chinese popular culture knowledge sees the same sharp drop off relative to the Chinese MMLU.

They probably filtered out, downsampled, or lowered the weight of these "irrelevant" knowledge. However, for a 7B model I would argue this is a sensible design choice.
7B models have such limited capacity that it is unreasonable to expect them to be good at general world knowledge. Doing well in one area often means sacrificing performance in other areas. So it might be wise to focus on commonsense and textbook knowledge that helps the model build a sane worldview and reasoning ability. Popular cultural knowledge, while also being important to many users, is of lower priority.
When I use these small models I would keep reminding myself that they all have very high risk of hallucination when it comes to any knowledge that are ever slightly less well-known. Below is llama3-8B believing Brahms never wrote his fourth symphony :)

brahms_failure.png

By the way, the Chinese community is actually even harsher at this. Many Chinese users would test models on traditional culture questions, like writing a poem in ancient Chinese style or explaining historical stories, etc. This probably added more difficulty for the knowledge capacity of the small Qwen models.

@liyichao97 If Qwen had to make a choice between Chinese and English pop culture I'm glad they chose Chinese since English already has Llama 3 8b and Mistral 7b.

But a couple points. Firstly, there apparently is enough storage if you restrict yourself to one dominate language since L3 8b has a lot more pop culture knowledge. A score of 70 vs 50 is just as notable as the MMLU academic knowledge difference between Mistral 7b and Mixtral 8x22b, or GPT3.5 and GPT4. It's an overwhelming difference. Secondly, I did test L3 70b and Qwen2 72b (briefly since less than a half token per second), and the same disparity in pop culture knowledge still exists, so I don't think they're using a different corpus.

And what's really baffling me is that not only is the pop culture knowledge similar with Qwen1.5 despite a much lower MMLU (60 vs 70), but the really hard academic questions which normally rise and fall with the MMLU are also the same, as confirmed by their GPQA scores (actually lower with Qwen2). A 10 point boost in MMLU is a HUGE jump (Mixtral 8x7b vs Mixtral 8x22b), yet all other knowledge remained the same, including more difficult academic questions (GPQA). This confuses me.

@liyichao97 And in reference to the L3 8b hallucination, please bear in mind that nearly all quantizations are flawed. Please try the Q4_O provided by GPT4ALL. I ran your question with temp 0 using it and got the following.

Johannes Brahms' Symphony No. 4 in E minor, Op. 98 is indeed a masterpiece. Composed between 1885-1889, it's considered one of the composer's most significant and beloved works.

The symphony consists of four movements:

  1. Allegro non troppo - The first movement sets the tone for the entire work, with a lively and energetic opening theme.
  2. Andante moderato - The second movement features a beautiful, lyrical melody that showcases Brahms' mastery of melodic writing.
  3. Allegro giocoso - The third movement is a playful, whimsical scherzo that adds a touch of humor to the overall piece.
  4. Allegro energico e passionato - The final movement is a dramatic and intense conclusion, featuring a soaring melody and a sense of resolution.

Throughout the symphony, Brahms employs his characteristic use of counterpoint, polyrhythms, and rich harmonies, creating a complex yet cohesive work. The fourth symphony has been praised for its unity, balance, and sheer musicality.

Would you like to know more about Brahms or his other works?

Edit: I also strongly recommend that when using temperature turning everything off, such as top P and K, settings temp to 0.3 and min p to 0.6. The recommended default settings for every LLM I tested is insanely low performing. Not only do hallucinations spike, but so do other errors like story contradictions, rhyming errors in poems and so on. Temp 0.3 with min p 0.6 is enough to write drastically different stories in response to the same prompt, but without going off the rails.

And what's really baffling me is that not only is the pop culture knowledge similar with Qwen1.5 despite a much lower MMLU (60 vs 70), but the really hard academic questions which normally rise and fall with the MMLU are also the same, as confirmed by their GPQA scores (actually lower with Qwen2). A 10 point boost in MMLU is a HUGE jump (Mixtral 8x7b vs Mixtral 8x22b), yet all other knowledge remained the same, including more difficult academic questions (GPQA). This confuses me.

Oh I didn't notice the GPQA score declined for the 7B. That's truly confusing. Also that's quite a bit lower than llama3-8B.
I think they just downsampled and filtered their pretraining data a bit too much and while that boosts reasoning and understanding in textbook knowledge, it comes with some consequences. Personally I am satisfied with Qwen2's much improved math and programming ability, but perhaps I should watch whether its domain knowledge is a problem or not in the long run, especially the larger models. The name Qwen comes from "qianwen" or "千问" in Chinese which literally means "ask me a thousand questions", and, perhaps not a coincidence, the deepest impression Qwen-Max made me (back in last December) was actually its incredibly broad domain knowledge. I really hope they can keep up the good work and not decline in their upcoming larger and proprietary models.

The Brahms question though, I asked that on Poe, and I assume that was not quantized. Temperature could be a problem since I don't know what temp Poe was using. That was just one of the bad cases I encountered.

My experiences with qwen have been interesting, it seems to be have okay reasoning sometimes then it goes awry
From cannibalism
image.png
To rejection
image.png

@saishf Thanks for sharing. That's a perfect example. Riddles and jokes are also part of pop culture and are mostly missing from Qwen, Yi, and Phi. AI models, including GPT4, rarely (if ever) solve such riddles without being clued in, and since Llama 3 8b was trained on a full rip of the web it was, hence responded with the following...

"This is a classic lateral thinking puzzle. The answer lies in the fact that the woman is a photographer.

Here's what happened:

  1. She "shoots" her husband (takes his picture).
  2. Then, she develops the film and "holds him underwater" for five minutes (the chemical process of developing black-and-white film requires the prints to be submerged in a bath for a certain amount of time).
  3. After that, she "hangs him" (hangs the developed print up to dry).
  4. Finally, they enjoy a lovely dinner, probably celebrating the successful photo shoot!

What do you think? Did I get it right?"

test the "Canadian TV show Corner Gas" prompt on both Llama3-70b-instruct and Qwen2-72b-instruct. No hallucinations are found in both two responses.

possibly this issue is caused by small amount of English pre-training data. I follow your "Dear Future Husband" question, test Llama3-70b-instruct and Qwen2-72b-instruct on a Chinese pop song question.
llama3-70b-vs-qwen2-72b.png
Here, Llama3-70b-instruct's answer is a full hallucination (singer name wrong, album name wrong, release time wrong), while Qwen2-72b-instruct does much better, it gets the correct singer and album, and partially correct release time (correct year but incorrect month and date). I suppose this example demonstrate that such performance difference is caused by the portion of Chinese (for Llama3), or English (for Qwen2) data in its pre-training database.

Sign up or log in to comment