Regarding Concerns about MMLU Scores
There's a tight correlation between MMLU (diverse knowledge) and parameter count because you need somewhere to store the additional information, which in the case of LLMs, requires increasing the parameter count.
For example, this is why LLM families like Llama 2 (7b, 13b, 33b & 70b) see a smooth and predictable MMLU gradient to parameter count that's unseen in nearly all other tests, such as WinoGrande.
Anyways, it's easy to get a feel for the true MMLU score of any LLM by asking a series of fringe esoteric questions, and Yi-34 dense is no more knowledgeable than Mixtral 48 sparse (~70 MMLU for both). Yi-34's 77 MMLU score is without a doubt due to test contamination in their foundational model. And after testing this LLM it got no more fringe esoteric questions right, so it too has an MMLU of ~70.
If you were going to deliberately cheat you obviously wouldn't add it all to MMLU so something odd happened to boost the MMLU from 70 to 85.6. Perhaps since Yi-34 is already highly contaminated with MMLU data (7 boost), you did something that brought out even more of said contamination.
Regardless, the only way for a modern dense LLM (e.g. Mistral or Yi) to achieve an MMLU score of 85 is to have ~270 billion parameters (5 point MMLU increase for every doubling of parameters). It's simply not theoretically possible to store that much information in a current generation 34b dense LLM.
You are making groundless accusations. Especially when you mention Yi-34 dense compared to Mixtral, I think the former has a larger activation param. At least at this point, I hope you won't make inappropriate remarks about other models here.
But I need to emphasize that the MMLU score does not mean anything special, nor is it exactly a parameter count, but is more data-oriented - which is consistent with our approach. In an updated version, we used more than 50M pieces of data that were crawled from the web and comprehensively rewritten using multiple web pages and large context LLM, of which more than 10M pieces were the result of GPT-4 level models.
Of course, the model in this repo is not the final version. However, the changes in MMLU scores can be reflected quickly - this indirectly proves the theory that LLM is a compression model. By simulating the compression results of larger models on a wider range of data, we can obtain simulation improvements in compression capabilities.
Contamination detection on MMLU is also provided in the repo, which is at a level of safety. The only potential concern arises from not actively filtering the content of the test set, from web crawler data.
In fact, you are not the first person to make this statement, but it makes me lose confidence in releasing the whole dataset (I have released a 1M+ subset), and the new models. I spent more than any synthetic datasets you can see here, dozens of times larger than the openhermes dataset, and what I got was just subjective conjecture.
@JosephusCheung Contamination testing is very unreliable, which is partly why people all but stopped flagging models on HF. And with a dataset as large as OH (~1M), and you claiming a much larger dataset, there's no way to avoid notable contamination. Plus part of why it's unreliable is it works even when the data is reworded and stored in a different language, in this case that most likely Chinese. This is almost certainly why Yi-34 has an MMLU score of 77, yet a true score of around 70.
I'm honestly not making accusations of deliberate contamination, but I am saying with 100% certainly the true broad knowledge of this LLM (MMLU score) is around 70 and nowhere near 85.
And if it was people would be dying to figure out what you did. Of the 1000s of fine-tunes, not just of Yi-34, but all other models, yours is one of only a few with a notable MMLU bump.
Finally, I read a lot of papers on this, including information provided by Meta, MistralAI and other makers of foundational models. And MMLU is not something you can ever notably change with fine-tuning. It's locked to the parameter count of a given technology. For example, by using far more sophisticated data filtering Mistral was able to increase MMLU by ~5 points per parameter count over LLama 2's internet dump approach. And Microsoft, by using nothing but low redundancy text book quality data, plus avoiding data like pop-culture that's largely excluded from the MMLU, was able to add another ~5 points to MMLU per parameter count in Phi-2. This is currently the limit.
Take a look at the test questions. Nearly all don't require a high IQ or advanced language skills to get right. They primarily just require extensive knowledge of various fields of study. And a 5 point gain on MMLU requires about twice the world knowledge, hence the required doubling of parameter count. Again, Yi-34 has a true MMLU score of 70 (their score of 77 is due to contamination). And I tested this LLM and it has no more knowledge than any other Yi, so it to has an MMLU score of around 70 as well. Something other than an increase in knowledge/data is accounting for the score of 85. And like I said, I suspect you stumbled on a way to bring the contamination that's already in Yi out (likely buried in non-English languages like Chinese, making traditional contamination testing all but useless).
Please consult to this experimental model: https://huggingface.co/itsliupeng/llama2_7b_mmlu
MMLU of llama2 7B got 46.87->60.04, on such a small dataset.
As I mentioned in other answers, this is largely a matter of cumulative effects and data set bias. Although I did not use such extreme reverse retrieval data, instead, I use web crawlers of my own, I do think that unintentional contamination may exist - but I think it is within a reasonable range, see the contamination test.
As for your other remarks, for example, you claim that Yi's score is much higher than it actually is, I hope you will not make such irresponsible remarks, and I also do not want you to criticize other people's work here. In fact, these are all conjectures based on your subjective experience, and I think the performance of the model should be consistent on new test data that is isomorphic to MMLU but has different content. This was also confirmed in my communication with Yi developers.
And I think it is difficult to get the same results in a direct cheating training without damaging general performance of the model. Even if you train directly on MMLU answers, you may not be able to get such a MMLU score, not to say with the model working correctly on other tasks. However, I still do not believe the increase in MMLU score actually translates to improved performance on downstream tasks, and that I believe we are still far from OpenAI GPT-4, and that any victories on narrow subdomains are not very meaningful. As such, I'm hesitant to tout this high MMLU score as I don't see any tangible benefits it brings, but rather present evidence of non-subjective contamination to avoid unfounded accusations and debates.
@JosephusCheung I have more than my fair share of blind spots, but what I do know is this model, including Yi, didn't perform any better on esoteric knowledge than other models like Mixtral that scored 70 on MMLU. And since even superhuman IQ and language skills can't bring the score up to 85 on a test like MMLU of expertise knowledge across various domains there's simply no way to notably increase MMLU with traditional fine-tuning. Perhaps there's a technique that can access the data better (e.g. Laser), but none of those techniques so far made any real difference.
MMLU is currently baked into foundational models. You simply ain't going to boost it with fine-tuning, at least not with any fine-tuning techniques currently available.
And it's not that Yi cheated to gain 7 points. It takes a careful data preening effort across all languages to remove contamination from the corpus, and they clearly didn't do a good job.
I don't think MMLU is IQ or any other broad ability, but only reflects the model's ability to compress large-scale knowledge. And it is difficult for me to agree with the theory that Mixtral is better than Yi, whether it is the score or the parameters you mentioned (Yi's activation parameters are much larger than Mixtral, which is equivalent to the total number of parameters of sparse MoE, so it is also theoretically better under the same training scale of)
The opinions you raise are mostly based on your personal experience and have nothing to do with MMLU. It's off topic.
Your comment makes me regret releasing this model.
@JosephusCheung It sounds like you agree that an MMLU score of 85 doesn't equate to any real-world difference. The vast depths of knowledge by models like GPT4 and Opus becomes overwhelmingly apparent. Yi-34b doesn't come close. It's far smaller, so it shouldn't. If this LLM genuinely had the ability to achieve 85 on the MMLU its knowledge would also become overwhelmingly apparent, yet it's not.
And that's my only point. Every ~5 point gain on the MMLU results in a very noticeable jump in knowledge (about twice as much). So any LLM scoring 85 would be unmistakably different in my "personal experience", or the experience of any other user.
And yes, Yi-34 is dense vs the sparse Mixtral-48b, so all things being equal it should have more net knowledge despite having fewer parameters, but not by much, especially considering the strong focus on both English and Chinese. Plus, I'm sorry, but Mixtral has a superior build quality.
- The MMLU score is genuine, and the same goes for Yi.
- MMLU is correlated with model performance, but not causal.
- This improvement can be achieved on different pre-trained models utilizing the same dataset. It is essentially highly compressed continual pre-training, and is reproducible.
The improvement is also reproducible on cohere command-r 35B fine-tuning with the same datasets .
In fact, the improvement for GSM8K is more significant, even though the dataset does not contain any task-oriented fine-tuning like MetaMathQA, and it should be same with long context retrieval. I'm holding off on releasing future models due to the potential for more controversy on contamination.
I did some testing on this model today:
EQ-Bench
86.05 gpt-4-1106-preview
75.52 dolphin-2_2-yi-34b
72.68 Nous-Hermes-2-Yi-34B
72.68 CausalLM/34b-beta
71.62 Yi-34B-Chat
MAGI (extra-hard subset of MMLU + AGIEval)
79.42 CausalLM/34b-beta
77.85 gpt-4-0613
63.03 Nous-Hermes-2-Yi-34B
60.66 dolphin-2_2-yi-34b
57.1 Yi-34B-Chat
The outsize performance increase on the hardest questions in MMLU + AGIEval indicate contamination. And of course the fact that it beats GPT-4.
The MAGI subset is more sensitive to parameter size than the original full test sets. It tends to highlight contamination better than full MMLU because models that have memorised answers will have an outsize advantage on extra-hard questions that most models get wrong.
We don't see the same (or any, really) corresponding performance improvement on EQ-Bench over the base.
These are just results & my interpretation of them. FWIW I don't think you were cheating, and I don't think there's any utility to making accusations of cheating. It's hard making large training sets, and contamination is a fact of life. Perhaps you could consider releasing the full training set and the community can help you with decontam.
I hope you can tell me why: a model's MMLU score is high, and its EQ-Bench score is also high. I can't understand the logic. I think this is somehow off-topic.
We don't see the same (or any, really) corresponding performance improvement on EQ-Bench over the base.
I have no idea why there should be.
I think the promotion is not balanced, which leads to drastic changes in narrow areas. I will write more here later.
EQ-Bench seems far away from the target tasks we focus on. Therefore, compared with other 34B fine-tuning, there is no obvious decline, which is beyond expectations.
We all know that the author of dolphin artificially introduced a lot of "humanized" designed corpus, so this is also to be expected.
And you can see that dolphin-2_2-yi-34b and Nous-Hermes-2-Yi-34B are also not consistent with the two benches.
Also I think this is irrelevant because our main work is continuous recall in pre-training corpus, and we believe that the expected goals are well achieved, and the theory of LLM as text compression has also been proven in practice. I need to point out that MMLU does not contain many logical elements compared to other benchmarks. Therefore, as we continue to train with the goal of recalling the pre-training corpus, this kind of improvement in a narrow field is expected, but in general logical reasoning, it is not easy to not have a significant decline.
I can also share with you a few conclusions that can be quickly verified here: 1. This kind of recall is precision sensitive and cannot be guaranteed by introducing quantization in training or inference. 2. Even if you train directly on the answers, you cannot get similar results, which we have recently verified.
We are reviewing the potential contaminating elements contained in the dataset, but I need to inform you that this is no less a large undertaking than synthesizing this dataset. Eventually we may publish a conclusion, but we cannot exclude this part of the training - ālea iacta est. We expect any exclusions to result in downgrades elsewhere.
@sam-paech Thanks for doing the work. Your reasoning is sound and seems conclusive to me. Contamination can creep in from anywhere.
Let me clarify again that this improvement should be similar when you use completely new data in style of MMLU. Moreover, I believe that there should be no performance degradation in the published model in some MMLU variants that "change the order or expression of options".
I don't think MMLU is IQ or any other broad ability, but only reflects the model's ability to compress large-scale knowledge.
I believe we are still far from OpenAI GPT-4, and that any victories on narrow subdomains are not very meaningful. As such, I'm hesitant to tout this high MMLU score as I don't see any tangible benefits it brings, but rather present evidence of non-subjective contamination to avoid unfounded accusations and debates.
I'm reopening this discussion for any lingering questions about MMLU's performance or its performance on MMLU-style tasks. If your concern is about the discrepancy between MMLU's improvements and those observed on other benchmarks, I'd appreciate it if you could first elaborate on why you expect them to be consistent.
If you're unfamiliar with MMLU, please take some time to learn about it before engaging in this discussion.
If your concern is about the discrepancy between MMLU's improvements and those observed on other benchmarks, I'd appreciate it if you could first elaborate on why you expect them to be consistent.
I've benchmarked 100s of models. It's overwhelmingly the case that improvements on EQ-Bench predict corresponding improvements on MAGI / MMLU. EQ-bench (the original v1 at least) has a r=0.97 correlation with MMLU. I think that's reason enough to form the basis of an expectation that performance increase on MAGI / MMLU is likely to be reflected in some degree on EQ-Bench scores.
It's not impossible for there to be counter-examples. However the magnitude of the disparity, and also the outsize performance increase on the hard/most discriminative questions in MMLU are what makes this result an extreme outlier.
Outliers want to be explained. People are curious.
@sam-paech
You need to know that correlation does not imply causation. For a similar example, see refrigerator mother theory.
Also, you can read the content of these two benchmarks. I don't think they are particularly relevant.
It's like many models perform differently on GSM8K than their base model. I don't know what makes you think MMLU is a different case.
According to your statement, I think it is also true to replace MMLU with GSM8K, because good models are usually multivariate.
@JosephusCheung Yes, GSM8K is also a major issue because of the metamath fine-tuning database (natural language solutions to simple mathematical problems).
Personally I don't think it should be used even though it's not technically contamination because the boost in real-word mathematical abilities is small relative to the boost in GSM8K scores. It basically just minimizes stupid mathematical mistakes and doesn't improve the solving of complex mathematical problems.
However, there are 2 important differences.
(1) Math is a single domain, while MMLU covers very diverse domains (law, science, math, technology, philosophy...), so raising GSM8K can done with a tiny fraction of the total data needed to raise MMLU scores by the same amount.
(and 2) Increasing logic, reasoning, coding and other cognitive abilities can raise GSM8K scores because a lot of the wrong answers aren't due to a lack of mathematical knowledge, but rather due to cognitive mistakes or stupid errors during multi-step mathematical problem solving. In contrast, the MMLU is basically bunch of single-step simple multiple choice knowledge questions. Relatively few questions are gotten wrong due to a want of understanding. The LLM simply failed to recall the requisite knowledge.
@sam-paech
Since you have put forward unproven hypotheses, let me put another here: I have not seen any base model in EQ-Bench Leaderboard, all are SFT/Chat models. I first expected that the base model would not perform well here. Does this mean that the EQ-Bench and SFT processes are coupled?
Since the MMLU we discussed is based on the results of open_llm_leaderboard and does not follow the chat template, I think this is another situation that is decoupled from the SFT process.
My conclusion here: EQ-Bench is a task that requires targeted SFT to achieve better performance, similar to MT-Bench and AlpacaEval but narrower, more like a subtopic of them.
EQ-Bench Leaderboard, all are SFT/Chat models. I first expected that the base model would not perform well here. Does this mean that the EQ-Bench and SFT processes are coupled?
It just means that EQ-Bench is a generative benchmark with prompts that require a certain degree of instruction following. Base models aren't generally able to follow the instructions well enough to produce parseable responses.
You need to know that correlation does not imply causation. For a similar example, see refrigerator mother theory.
Also, you can read the content of these two benchmarks. I don't think they are particularly relevant.
It's like many models perform differently on GSM8K than their base model. I don't know what makes you think MMLU is a different case.
According to your statement, I think it is also true to replace MMLU with GSM8K, because good models are usually multivariate.
I'm not implying causation, I'm saying EQ-Bench is statistically highly correlated with MMLU and here we have a strong outlier.
GSM8K has one of the lowest correlations with Arena ELO (https://twitter.com/gblazex/status/1746295870792847562/photo/1), and likely with MMLU, because GSM8K tests a narrow set of abilities. Those abilities can be targeted with fine tuning somewhat independently of the more holistic abilities tested by Arena ELO or MMLU.
EQ-Bench has very high correlations with both Arena ELO and MMLU. This is why we would expect to see such disparities between MMLU vs GSM8K, but it's less statistically probable to see them with EQ-Bench vs MMLU. That's what makes this result an outlier.
The fact that it's an outlier is statistics, not interpretation.
EQ-Bench vs other benchmarks correlations are here:
https://arxiv.org/pdf/2312.06281.pdf
All benches are having very high correlations with others, cuz good models are likely to be generalists. Mine is not, in the current context, so as c4ai-command-r-v01 and Qwen1.5-7B-Chat, we are "outliers".
My conclusion here: EQ-Bench is a task that requires targeted SFT to achieve better performance, similar to MT-Bench and AlpacaEval but narrower, more like a subtopic of them.
Again, if you can accept that GSM8K can be “abnormally” high due to the introduction of MetaMathQA or similar data, then you should also understand the case of MMLU.
@JosephusCheung c4ai only has an MMLU of 68, which is low for its parameter count. Plus it only has an Arc of 65.5, again low for its size, and means it's kinda dumb. And when using it the responses are often malformed and deviate from the prompt.
c4ai simply isn't a good LLM for its size, plus its responses are unusually unreliable so even if it was only 7b I wouldn't use it. This is why it's an outlier. None of its scores on the leaderboard, or on EQ-Bench, surprise me.
Besides, the high performance on the hard subset of MMLU is the primary anomaly.
c4ai simply isn't a good LLM for its size, plus its responses are unusually unreliable so even if it was only 7b I wouldn't use it. This is why it's an outlier. None of its scores on the leaderboard, or on EQ-Bench, surprise me.
After doing some testing, I came to the opposite conclusion than you: they may overfit on special use cases (such as special format in RAG recall), resulting in mediocre performance on other tasks. In fact, this model is very good and better than Yi-34B. You can refer to the LMSYS Chatbot Arena Leaderboard for this. I also shared with you the results of my further trained version, and the improvements were similar on the same data.
But to be honest, it's hard for me to just release this model in the face of a witch-hunt-like atmosphere.
We are just discussing the results. I'm sorry you feel like it's a witch hunt.
All benches are having very high correlations with others, cuz good models are likely to be generalists. Mine is not
Could you expand on how your model is not a generalist, such that it makes sense for it to have outsize performance gains on mmlu?
@JosephusCheung Don't let me lead things off topic with my opinion about c4ai.
And here's another tangent because this discussion isn't going anywhere constructive anyways.
I have ABSOLUTELY no respect for chat arenas. NONE. Starling 7b beta is incredibly weak and performed horribly on my personal test, failing miserably in response to even superficial complexity, yet is way up in the chat arena.
Clearly something is wrong with the type of prompts, people spending time in chat arenas or something else. I strongly suspect people are just upvoting inferior responses because they like the sound of them better. Like when doctors got beat in patient ratings by AI despite the fact that human doctor responses were technically superior. What happened was since patients know very little about medicine, and are often overly sensitive, they favored the more empty and factually incorrect responses by AI because they were superficially more respectful. The vast majority of people are dumb.
Yi-34b also performed horribly in my testing. Aside from superior language skills it scored lower than Solar on everything else, including reasoning and esoteric knowledge.
Could you expand on how your model is not a generalist, such that it makes sense for it to have outsize performance gains on mmlu?
This is related to the training method I have repeatedly saying here: I did not fully focus on training instruction following tasks (such as flan, orca). On the contrary, I take LLM as a simulation of text compression, and now I am just using a smaller model to simulate a larger model on recalls from the pre-trained corpus and continues to train seen and unseen knowledge in the same way. Of course, I also trained on chat data, but similar to Yi-34B-Chat, I used small, tiny, human-edited data.
You can certainly do further training on flan, orca, open hermes, etc. I'm just giving a reference that continuous pre-training introducing new knowledge does not significantly degrade other performance, even without using replay of the pre-training corpus. So you would expect to see that the model will have a higher accuracy in judging the "existence" of the facts expected in the pre-training.
But as I said before, I did not do any special processing for MMLU. This improvement is also reproducible in other models. I don't think this score has any special meaning, especially when you consider MMLU as just another task rather than overall performance.
I understand both of your perspectives. You see MMLU as more than just a task, holding a special status perhaps due to its widespread use and historical significance. Indeed, some mathematical benchmarks like GSM8K were once considered to be directly related to model size, training scale, and parameter count, reflecting the scaling laws. However, I believe that as our training surpasses chinchilla optimal, dataset bias on specific tasks needs to be taken into account. The model's capabilities are not uniform, which does not contradict the scaling laws.
@JosephusCheung This contamination issue never goes smoothly because EVERYONE gets defensive. You're certainly not alone. And to make matters worse people with harsh personalities and limited knowledge like myself join the fray.
However, facts are facts. This MMLU score simply isn't theoretically possible. And as soon as the leaderboard evaluation results came back you should have hidden/deleted it until you figured out what happened.
^ Thanks for expanding. It would be great to get more concrete details (plus the full dataset) so we can dig into your approach.
I'll leave this here to help visualise the data that I have:
Even following your analysis, we should focus on the angles to the coordinate axes rather than the clustering of scatter points.
If you continue to introduce some results from the base models, I believe the angle will be more biased towards y-axis. If you introduce a LIMA-like model, I believe there will be similarly fancy representations.
This MMLU score simply isn't theoretically possible.
I don't understand your reasoning for deifying MMLU. It's just a representation of a single perspective. Good models usually don't perform poorly on MMLU, but conversely, a high MMLU score doesn't necessarily signify anything special. This is also why I never actively mention this score, as it's not particularly meaningful here.
I have expressed with restraint:
MMLU results should not be coupled with performance on other, different types of tasks. Even though good models often score highly on both MMLU and other benchmarks, this is not directly relevant to the MMLU score in this context.
MMLU does not represent the full capabilities of a model and is only a narrow use case. There is no need to mythologize the significance of this score based on its history or widespread of use.
If your discussion is not based on these two points, you are welcome to continue replying here. I will not respond to any further comments regarding these two points.
@JosephusCheung Don't let me lead things off topic with my opinion about c4ai.
And here's another tangent because this discussion isn't going anywhere constructive anyways.
I have ABSOLUTELY no respect for chat arenas. NONE. Starling 7b beta is incredibly weak and performed horribly on my personal test, failing miserably in response to even superficial complexity, yet is way up in the chat arena.
Clearly something is wrong with the type of prompts, people spending time in chat arenas or something else. I strongly suspect people are just upvoting inferior responses because they like the sound of them better. Like when doctors got beat in patient ratings by AI despite the fact that human doctor responses were technically superior. What happened was since patients know very little about medicine, and are often overly sensitive, they favored the more empty and factually incorrect responses by AI because they were superficially more respectful. The vast majority of people are dumb.
Yi-34b also performed horribly in my testing. Aside from superior language skills it scored lower than Solar on everything else, including reasoning and esoteric knowledge.
Honestly this is the strangest take here. I have not really pitched in because I think everyone's had a pretty valid here on both sides, and have felt that there's been more an issue of communication and understanding rather than anything done in bad faith. But this one makes the littlest sense..
First of, all it's okay if you don't like chat arena, that's fine, it's cool to have your own opinions, but I think this brings up an important discussion here, fact vs opinion. You cannot claim there's something seriously wrong with a test based on users opinions just because you don't agree with the results.. I personally don't think base yi and most of its finetunes are very good either. I also thought starling beta was kinda mid in my own tests. That doesn't make the chat arena results less valid. It just means I have a different preference from the general user base of chat arena, because they are voting based on what they like more, or maybe my use case is very different, because they are choosing their own prompts which maybe different from the kind I might use. And that's totally okay. That's just what it is and the kinda test it is. Doesn't mean the test is bad or invalid, or inferior like you call it for that matter, because most of it is based on user opinion, you would be calling everyone's thoughts and input inferior. You are I think part way there to understanding, that all this test really provides is an overview of what users think are good sounding answers, whether it be on superficial prompts or not. So if that doesn't hold any substance to you, that's completely fair and why it might not be a good way to evaluate models for you, but again this doesn't make it a bad test, just a misunderstood one if it's being used as representation for anything that it is isn't, without good understanding of what it is and does represent.
Back on the topic of discussion. I hope things can continue without anyone getting too heated, taking things the wrong way, and of course with everyone remaining respectful to each other. Perhaps a good place would be to start is what we consider contamination, because I don't think causallm has done any intentional cheating and that there might be a bit of friction and defensiveness because of a communication barrier (or maybe because of how some may have come off over text). The term contamination itself sounds offensive doesn't it? I can understand feeling a little attacked or disrespected over accusations, but I think there are at least a few here who are just genuinely curious about the model, how it was trained, etc, and want to understand what kind of data might have been used for some benchmark restults. I personally really like the 34b beta model cause in my own personal tests it does really well so benchmarks don't really matter to me here. Would be cool to have the datasets released to see what the community thinks and hopefully find ways to tune even better models, but that is of course not up to me or anyone else other than casual what they do with their own data. I hope with a little better understanding (my own is lacking still so I am learning a lot from this discussion so far), that some bad blood here, if there was any, can be washed under the bridge because it would be a shame if things were ruined for the curious others who did come here in good faith, over butted heads.
There are now over 5,000 evaluated LLMs on the HF leaderbaord, and despite widely varying types and depths of fine-tuning the MMLU scores remain very stable. This LLM is an ~1 in 1,000 outlier in that its MMLU score jumped significantly over the foundational model. So it doesn't matter if it wasn't intentional, something undeniably unique happened here.
I took parts of the MMLU. It's just a multiple-choice knowledge test across all major domains of knowledge. If I didn't know an answer obtaining new knowledge through research always helped me get it right. MMLU doesn't test IQ, errors across multiple-steps... It's just a knowledge test, so the only possible ways to score notably higher on a knowledge test across all major domains of study is to (1) ~double the information contained by the LLM to raise the MMLU by ~5 points, (2) focus almost entirely on core knowledge, such as what Microsoft did with Phi to boost its MMLU despite only have 2.7b parameters, but in so doing can't answer nearly all pop-culture questions and other non-MMLU pockets of knowledge (or 3) contamination.
MMLU is, and will always be, scaled with parameter count. Also, Yi-34b DOES NOT have a legitimate MMLU of 77. Intentional contamination is highly unlikely. They simply did what Microsoft did with Phi and omitted pockets of information not covered by the MMLU. Sure enough in my testing it didn't score any better than 7b models in areas like pop-culture knowledge. So when it comes to broad common knowledge Yi-34b's true MMLU is 70 or less.
So when it comes to broad common knowledge Yi-34b's true MMLU is 70 or less.
If you cannot substantiate this claim, then it constitutes a deliberate attack based on your subjective perception. This is tantamount to a witch hunt. I will not respond to your accusatory statements again until you calm down.
@Phil337 is a troll who has nothing better to do than whine and complain about all the hard work we do and give away for free. Don't fret about his criticism.
@ehartford Chiming in to inform others I'm a troll, when if I am they can determine it for themselves, lacks sincerity. You're hurt because I called you out once, to which you responded with a series of childish memes. You may be right about some things. But you know I'm not a troll.
Does this 34b LLM have an MMLU of 85.6, or anywhere near that? Of course not. That's why I'm here. Not to troll.
I genuinely appreciate your work ehartford, but I've seen how you interacted with numerous people, and you're a toxic narcissist. You didn't chime in here for any other reason but because you were butt hurt by me.
Actually my reason was to let @JosephusCheung know that he's not the only one who has put up with your nonsense.
@ehartford So, he released a Yi-34b that scored 85.6 on the MMLU, and my shining light on that impossibly high score is "nonsense"?
You're right. I did a very similar thing with you and a couple others out of thousands, but for good reasons. In your case, you took a Mistral that scored an absurd 77 on the leaderboard, which is significantly higher than Mixtral Instruct, Llama 2 70b and other far larger and more performant LLMs. It was clearly too good to be true. But instead of testing Experiment26 out you rushed to spread it around by fine-tuning it. That wasn't smart.
My primary motive was to nip this in the butt before it spread like a virus, with an endless stream of high scoring, low performing, LLMs due to a countless mergings and fine tunings of whatever happens to be at the top of the leaderboard.
I tested it, and it was a horrible performer. It not only got simple logic problems wrong, but stubbornly refused to accept making an error. The same thing happened with basic facts. For example, it said Tom Hanks starred in a movie that Bruce Willis starred in, then kept insisting I was wrong and fabricating reasons it was right (e.g. Bruce Willis was initially asked, but dropped out due to prior obligations). So then I asked the LLM existential questions like if It ever admitted being wrong and it flat out said 'you won't catch me admitting error'. Somehow making it a stubborn ass allowed it to artificially achieve a leaderboard score of 77, despite only have the performance matching Mistrals with a score of around 67.
Anyways, long story short, my methods and personality may be unfortunately counterproductive, but I have always acted with the betterment of the HF community in mind. That's why I called you out when you rushed to fine tune Experiment26 (didn't want it to spread like a weed with 100s of mergers and fine-tunes), and why I'm calling out a Yi-34b that cannot possibly have an MMLU of 85.6.
Go build something instead of bothering people who are building something.
@ehartford So you, as a builder, are OK with this 34b LLM's 85.6 MMLU? How about a 93.5 MMLU? How about a 100 MMLU? Are scores meaningless?
I can spend hours testing a 77 Mistral that performs horribly, and I shouldn't report back? I should build, or shut-up? If I'm spouting pointless word vomit then ignore me. Stop replying. It would look silly if I kept posting all by my lonesome.
You're repeatedly responding not because I'm a troll or wrong. You know I'm on to something. You know this LLM didn't score anywhere near an 85.6 on the MMLU. You know Experiment26 didn't earn an aggregate score notably higher than Mixtral Instruct and other far larger and more performant LLMs. I don't have to build shit to call people out. Testing is all you need.
If you think there a bug with lm evaluation harness go file it there.
Just want to inject some more data into the discussion:
These are score trends on progressively harder subsets of MMLU. Evidently this model is scoring far above the other top higher parameter models like miqu/mistral medium. The CausalLM model isn't as reactive to increasing difficulty as the other models; this difference is especially stark in comparison to yi-34b-chat.
I'll leave the interpretation of the chart to you. I will just say that the argument that MMLU is testing narrow abilities, or that this SOTA-beating result is "not particularly meaningful" hold no water for me, and is frankly puzzling.
@sam-paech I don't understand what you mean. Are you talking about the relationship between the parameters of the model and their values, or the pattern of your curve? If the latter, please elaborate on your expected results. Because in your example, the model with larger results obtained, the rate of decline is also slower.
These are score trends on progressively harder subsets of MMLU. Evidently this model is scoring far above the other top higher parameter models like miqu/mistral medium.
To be fair, you'd better compare models with similar MMLU scores.
The CausalLM model isn't as reactive to increasing difficulty as the other models; this difference is especially stark in comparison to yi-34b-chat.
In your curve, models with similar MMLU scores have similar trends - or I don't understand your intention.
Therefore, it is normal for the rate of change of the curve's slope to exhibit grouping related to MMLU scores. Or, to put it bluntly, your image does not illustrate any problem - you are simply pointing out that different models have different MMLU scores, which in turn leads to different curves with decreasing difficulty in your subset test.
To put it even more bluntly, the difficulty level cannot be generalized for models with different overall accuracy rates - it is determined by the way you filter the difficult subset.
see Texas sharpshooter fallacy
I sincerely urge you to avoid engaging in discussions with a presumption of guilt. When you present an image, I hope you can explain your intended message rather than leaving a potentially misleading image without comment and claiming to be data-driven.
@JosephusCheung You left up, and are shamelessly defending, a beyond absurd MMLU score of 85.6. I'm officially accusing you of deliberate contamination/cheating.
@JosephusCheung I believe I already made clear that I am not making any accusations or presumptions of guilt, and I think such talk is unhelpful. We are just discussing the results of your model. I'm not going to get in the way of you & phil duking it out, but I hope you can separate it from the discussion that you and I are having. If you can't move past the persecution narrative we are going to have to end the discussion.
To answer your questions:
please elaborate on your expected results.
I would expect to see a curve that looks like yours in the case of:
a. An extremely strong SOTA-beating generalist model
b. A contaminated model
For a Yi-34b-chat finetune I would expect the curve to roughly follow Yi-34b-chat's curve, give or take a bit. All the Yi-34b fine-tunes I've tested in this way have downtrended sharply at the end, relative to higher param models. This is a typical "correction" in the direction of param size that I see on these charts (this is also why the 7b models you circled on the leaderboard screenshot above have lower MAGI scores relative to EQ-Bench: because MAGI is brutal and corrects in the direction of param size). However your model trends the opposite (i.e. it trends down at a lower rate than higher param models).
the difficulty level cannot be generalized for models with different overall accuracy rates - it is determined by the way you filter the difficult subset.
It can, though, and this analysis gives reasonable score trends for every other model but this one. I've included a large selection of varied models in the analysis to determine what constitutes a "difficult" question, and taken care to control sources of bias. I'll have a writeup on the methodology soon so you can see how it works. In any case, your result here is as much an outlier as it appears; both in terms of raw score, and in the shape of its trend as it approaches the hardest subsets.
To be clear: If this is a genuine SOTA result and not a product of contamination, then all power to you, I want to see the dataset & methodology when you are ready to present them, so that everyone can benefit from your work.
@JosephusCheung You left up, and are shamelessly defending, a beyond absurd MMLU score of 85.6. I'm officially accusing you of deliberate contamination/cheating.
Whoever claims it gives evidence - see Burden of proof
@JosephusCheung You left up, and are shamelessly defending, a beyond absurd MMLU score of 85.6. I'm officially accusing you of deliberate contamination/cheating.
"Officially" lol. As if you are an official. Funny guy.
To be clear: If this is a genuine SOTA result and not a product of contamination, then all power to you, I want to see the dataset & methodology when you are ready to present them, so that everyone can benefit from your work.
This is precisely the issue I am trying to address - I cannot post anything in this atmosphere where it feels like I am being coerced into proving my innocence. In reality, the burden of proof for any questioning lies with the questioner to provide valid arguments, and that is what I am hoping to receive here.
Everything I have posted is simply part of my exploration process in investigating further commercialization technology options for open-source LLMs - including my previous work on the zero-loss conversion of heterogeneous models in model architecture (a mathematical method to remove Qwen1.0's QKV bias without performance loss), as well as my recent exploration of continuous pre-training processes.
I need to clarify that I am not trying to prove anything to anyone - nor do I have any need to seek funding through this method. I have neither the need to publish papers nor the necessity to secure investment through performance.
It is precisely out of this concern that, after publishing more potentially controversial content, I anticipate that some individuals may engage in unproductive disputes over irrational details. I want to address this issue proactively - by discussing it with those who have raised objections, in order to prevent more serious, unfounded accusations from arising later.
@JosephusCheung You hide your training data and methods, then expect others to find definitive proof before they're allowed to even accuse you of cheating? There is far less than a 0.01% chance you achieved an MMLU of 85.6 by legitimate means, hence my accusation.
You have no idea how excited I would be if you found a legitimate way to get an MMLU of 85.6 from a 34b model. And this isn't personal. You went through a long hostile back and forth without turning toxic, plus you seem smart. It's just that I know after testing 100s of LLMs, reading dozens of papers, taking part of the MMLU test, and so on, that this LLM does not have an MMLU anywhere near 85.6. And I also know you're smart and educated enough to realize the same. Hence I'm force to conclude you deliberately contaminated this model.
because MAGI is brutal and corrects in the direction of param size
I do not understand why you keep mentioning the absolute number of parameters. What do you expect to gain from models with more parameters, such as Falcon 180B and Grok?
In my opinion, it is you who stubbornly believes that MMLU is special and cannot achieve parameter-decoupled improvement in a manner similar to GSM8K. This is your preset position, and I do not understand its necessity.
For a Yi-34b-chat finetune I would expect the curve to roughly follow Yi-34b-chat's curve
I do not understand the logic here, especially considering the overall shift in pre-training data caused by continuous pre-training. I have used extensively rewritten versions of pre-training corpora to optimize their performance, including the complete Wikipedia and a vast amount of WebCrawl text, rewritten using models that truly reach the current sota level with long context. Obviously, from the perspective of LLMs as text compression models, the overall quality of the model has undergone an efficient shift.
@ehartford Get lost. My potential inappropriate use of word "officially" has absolutely no relevance to anything being discussed here. I find your refusal to even indirectly address the elephant in the room (an MMLU score of 85.6) in favor of everything you wrote in this discussion very telling.
I've presented the data & arguments that I have; the only intention I had was to highlight the anomaly and hopefully induce discourse in the direction of finding an explanation. Beyond what's been discussed here, I'm not sure we can determine anything further without more details.
The results I showed are suggestive not definitive; what would be definitive would be to see the dataset & methods and have them reproduced. I mean, if it were me I would be excited to prove my SOTA result. I truly don't get why you keep saying that you can't release under this kind of scrutiny. If your model has a SOTA result -- particularly by such a leap as yours -- you are going to get scrutiny! It's a good thing.
You hide your training data and methods, then expect others to find definitive proof before they're allowed to even accuse you of cheating? There is far less than a 0.01% chance you achieved an MMLU of 85.6 by legitimate means, hence my accusation.
No, I didn't. Thank you for raising your doubts.
I believe I have already publicly shared a data subset for continual pre-training. I only disclosed one topic, considering its lack in some widely used pre-training corpora (e.g., RedPajama). The released subset should be the largest synthetic dataset on Hugging Face that is entirely new and generated by GPT4-32K/3.5-16K, which is even larger than some excerpted data sets like openhermes 2.5 data.
I am happy to release more subsets and full models in the future, but I want to address these concerns first. Given that the cost of the released topic subset is around $25K, I believe this is already quite generous. I am certainly willing to continue releasing more data under the condition of review and prevention of direct profiteering by third parties, but I need a positive atmosphere.
As for the methods you mentioned that I am hiding, they seem non-existent. The approach is simple: train on high-quality data.
I've presented the data & arguments that I have; the only intention I had was to highlight the anomaly and hopefully induce discourse in the direction of finding an explanation. Beyond what's been discussed here, I'm not sure we can determine anything further without more details.
The results I showed are suggestive not definitive; what would be definitive would be to see the dataset & methods and have them reproduced. I mean, if it were me I would be excited to prove my SOTA result. I truly don't get why you keep saying that you can't release under this kind of scrutiny. If your model has a SOTA result you are going to get scrutiny! It's a good thing.
My model is public here. The synthetic dataset I created is very costly, and I need to avoid: 1. Direct sale or profiteering; 2. Public backlash due to failure to proactively filter potential contamination. I have released a subset that is difficult to directly profit from, which you can also refer to. I have performed similar operations on a much larger amount of text.
@JosephusCheung I'm genuinely hoping that when the dust settles that you legitimately raised the MMLU score to 85.6. But with standard methods and training data, that simply isn't possible.
@JosephusCheung I'm genuinely hoping that when the dust settles that you legitimately raised the MMLU score to 85.6. But with standard methods and training data, that simply isn't possible.
I understand that you are not acting out of malice; I have understood this from the very beginning. However, I cannot guarantee whether this is another MetaMathQA-style deconstruction of GSM8K, or what some might perceive as sacrilege. This is why I refrain from boasting about this MMLU score, as in my view, it only reflects the fragility of the MMLU score itself - it is not a gold standard.
@JosephusCheung Thanks. I did several searches of that database and there clearly isn't any contamination. All that comes back is details about various animated shows. But there also isn't anything in it that would raise the MMLU by even 0.1 points.
@JosephusCheung Thanks. I did several searches of that database and there clearly isn't any contamination. All that comes back is details about various animated shows. But there also isn't anything in it that would raise the MMLU by even 0.1 points.
You have misunderstood my intention. I am merely showcasing a "harmless" subset to prevent anyone from directly profiting from it inappropriately. Of course, this topic is also relatively scarce in other large corpora. For me, publicly releasing the complete data wouldn't cause much harm to me, money spent already, but I am unwilling to see someone use a method that "can improve MMLU scores by 5%~8%" (reproducible 5% improvement on Command-R 35B.) to deceive investors, manipulate community sentiment, and thereby reap exorbitant profits. My refraining from boasting about it doesn't mean others will do the same. If MMLU scores suddenly inflate like GSM8K as a result, I believe none of us would benefit.
Or, I believe you understand the reasons why GSM8K and HumanEval are unreliable - not due to contamination, but because of their inherent vulnerability.
- Public backlash due to failure to proactively filter potential contamination.
If you truly want this resolved you can just be fully transparent so this can be determined independently. If you are worried that it's contaminated you should probably have said so from the beginning, and on the model card.
These are just suggestions. I'm not your boss.
If you truly want this resolved you can just be fully transparent so this can be determined independently. If you are worried that it's contaminated you should probably have said so from the beginning, and on the model card.
If you have ever attempted to create a synthetic dataset, you would understand the challenges involved. Verbatim repetition is obviously impossible, but rewrites caused by web text and content generated by GPT-series models during synthesis are beyond my control. Detecting such issues is no less difficult than synthesizing the data itself. You are asking a bit much form me.
Therefore, I have provided the model's own contamination detection result, which I believe is safe.
In other words, I think the level of contamination - similar to other common pre-training datasets - is acceptable.
If your model has a SOTA result -- particularly by such a leap as yours -- you are going to get scrutiny! It's a good thing.
I don't think everyone cares only about fame - If I were the author of MetaMathQA, considering the impact on GSM8K, I would never have released that data.
rewrites caused by web text and content generated by GPT-series models during synthesis are beyond my control. Detecting such issues is no less difficult than synthesizing the data itself. You are asking a bit much form me.
Did you log the content that was fed into the synthesis pipeline?
@JosephusCheung i've given a few LLMs a shot and right now, CausalLM-34B is the one that fits my needs best. APP users aren't too concerned about MMLU scores, they just want to know if a model performs well for them personally. And for now, CausalLM-34B has been quite effective in my experience.
Frankly i truly hope you keep refining this model, because as more people use it in real applications, an ecosystem will naturally form around it, which is essential.
Also, i'd advise against making training datasets easily accessible. You know, APP users can fine-tune with LoRA or similar methods to accomplish tasks without needing open data sets, maybe most requests for data are just driven by competition...
I don't think anyone needs my take on this discussion, but this is the internet, so whatever i guess.
I think it's important to take all forms of skepticism and concern seriously when it comes to fine tuning LLMs, due to the enormous amount of training data involved.
And I don't think it is reasonable to respond to that criticism with aggression and toxicity.
And I think it's its important to remember that all that criticism just comes from the fact that we don't want people to waste their time and money on LLMs that cheat benchmarks, but perform worse in real use. @JosephusCheung and @ehartford might just end up wasting money and resources without responding to these concerns, and so will many others deceited by these benchmarks if we don't take user feedback as serious, if not more, than benchmarks. And as a Free and and Open Source enthusiast, releasing more training data, or the exact data generation pipeline and the source data that takes is the best way to respond, to allow open research and to let us present our findings, and maybe potential solutions.
PS: pinging @Phil337 to make sure he knows that JosephusCheung didn't show him the full dataset, see the comments above. That still means MMLU test questions could have end up in the final dataset.
@nlpguy I'm with you. The best path forward is for all methods and training data to made public.
The counterargument is very strong. It's unfair to spend a large amount of time, effort and money curating a quality set of synthetic data, and then to have others immediately use it, depriving you of the uniqueness all that cost and effort brought.
But when things are done behind closed doors contamination inevitably creeps in, because without a large number of eyes scanning the data it's all but impossible to ensure clean data, as JosephusCheung pointed out earlier.
Perhaps a delayed release of all methods and data within a month or two would be a good compromise.
And frankly, I keep making things worse because of my harsh personality, and even had to delete about a couple dozen comments after crossing the line. This is part of the reason why others like ehartford are quick to anger with me. It's not just because they object to constructive criticism and feedback.
After a discussion with a companion today, I arrived at a simple solution: we just need to find a way to lower the MMLU score to a level that no one will object to, for example, by forcing output in CoT format for all responses instead of directly giving answers from time to time, thereby reducing the false positive accuracy.
We are all sinners, Happy Easter.
Come on now!, why end the discussion in such bad faith :(
You told Phil that you understand that he is not acting out of malice, and I'm not even sure you read his response.
Perhaps a delayed release of all methods and data within a month or two would be a good compromise.
Doesn't that sound reasonable, at least the idea if not the timeframe?