Navigating Korean LLM Research #1: Models

Community Article Published October 22, 2024

Large language models (LLMs) have become a major area of research globally. Unfortunately, much of this research has centered on tier-1 languages like English and Chinese, leaving a gap in the development of multilingual LLMs with broader coverage.

Accordingly, multilingual LLM research is an active field, with researchers from diverse backgrounds working to advance language models in their own languages. A key aspect of this progress involves monitoring developments across different languages, as some lessons learned can be transferable, potentially saving time and resources. Also, there are instances where findings need to be validated in a new language to check if they generalize. In such cases, reviewing all relevant material for a new language may be drudgery.

To address this, I’m planning to create a wiki of sorts for Korean LLM research, or perhaps a lightweight survey-style blog, to provide reference material for researchers working on similar efforts in other languages. This will be a series of three posts: (1) Models, (2) Evaluation Tools, and a final post that provides an overview of the performance of the models from the first post using the evaluation tools from the second.

Models

Speaking of Korean LLMs, they generally fall into three main categories:

  1. Korean-Centric: LLMs pretrained from scratch with Korean as a primary language in mind.

  2. Multilingual: LLMs trained on large multilingual datasets, officially supporting Korean alongside many other languages.

  3. Korean-Continual Pretrained: Multilingual LLMs that have undergone additional pretraining on Korean corpora.

Korean-Centric

LLMs pretrained from scratch with Korean as a primary language in mind

Although efforts to pretrain language models on languages other than English and Chinese are still relatively rare, some enthusiastic big-tech companies (like Naver, KT, and LG) and bold community initiatives (such as EleutherAI and KIFAI) have successfully delivered Korean-specific models, each with its own lesson.

Challenges of Korean-only Pretraining

For a time, efforts to create large-scale Korean language models were relatively limited in scale. Notable examples include Polyglot-Ko, Gecko-7B, and 42dot_LLM.

Polyglot-Ko was developed by EleutherAI, in collaboration with some Korean startups, as an open-source project to create Korean-specific models. It comes in four sizes—1.3B, 3.8B, 5.8B, and 12.8B—trained exclusively on Korean data. These were the first open models for Korean, and I’d say they are roughly equivalent to GPT-J. However, just as the English community learned that following Chinchilla’s optimal scaling law isn’t sufficient, we found the same with Polyglot-Ko. In Figure 1, we compare the training budget of Polyglot-Ko with Chinchilla scaling laws: while the smaller variants were trained longer than Chinchilla’s scaling laws and the larger ones were not, all performed poorly, limiting their practical applications. The small training budget at the time, however, was unavoidable due to limited compute and the scarcity of Korean-language corpora. Though I’m not privy to exact numbers, it's likely the team had access to fewer than 300B tokens. Eventually, Polyglot-Ko, highlighted that Korean-only pretraining is inherently difficult.

polyglot_budget

Figure 1: Comparison of the training budget for Polyglot-Ko against Chinchilla Scaling Laws

Following this are Gecko-7B and 42dot_LLM, both trained on mixed corpora consisting of Korean, English, and code. Gecko-7B is also an open-source project from two researchers trained on a dataset of 200B tokens. 42dot_LLM, as far as I know, was the first non-community effort and the first Korean model trained on over 1 trillion tokens. Unfortunately, all three models (including Polyglot-Ko) perform on par with non-Korean LLMs released around the same time, such as Llama-2 or Qwen-1 (see Table below), which somewhat diminishes the significance of Korean-centric pretraining efforts.

Model Name Release Date KMMLU
Random Baseline - 25.00
Korean LLMs
Polyglot-Ko-12.8B 2023.04 29.26
42dot_LLM 1.3B 2023.09 24.01
Gecko-7B 2024.05 30.70
non-Korean LLMs
Llama-2-7B 2023.07 25.00
Llama-2-13B 2023.07 31.26
Qwen-1-7B 2023.09 18.52
Qwen-1-14B 2023.09 30.92

Larger Scale Korean-Centric LLMs

We now see a new generation of LLMs from big tech companies like Naver and LG, delivering significantly better performance. HyperCLOVA X, a proprietary model from Naver, and EXAONE-3-7.8B from LG AI Research (available on Hugging Face), are standout examples. Both models are trained on mixed corpora of Korean, English, and code, but at a much larger scale compared to earlier efforts. For instance, EXAONE-3-7.8B is trained on 8 trillion tokens, marking a substantial leap. These models not only match some of the leading English LLMs in terms of English performance but also outperform them in Korean tasks.

exaone_perf

Figure 2: Performance of Exaone-3-7.8B from "EXAONE 3.0 7.8B Instruction Tuned Language Model."

Multilingual

LLMs officially supporting Korean alongside many other languages

Today, we have a variety of multilingual LLMs that support multiple languages. Notable examples include Gemma-2 (Google), Llama-3 (Meta), Qwen-2.5 (Alibaba Cloud), Aya-23/Command-R (Cohere), and GPT-4/4o (OpenAI). All of these models perform well on Korean benchmarks, and in my experience, they also demonstrate decent communication capabilities. However, a common issue among many models, with the exception of GPT-4/4o, is code-switching or illegal generation—where the model responds in another language or blends characters from different languages, even when prompted in Korean. This often involves mixing in Chinese characters. Interestingly, the resulting text typically still makes sense when translated into Korean.

code_mixing

Figure 3: Example of Code-Switching from Command-R-Plus

The figure above shows a generation from Command-R-Plus. The yellow highlight indicates a Chinese character mixed into a Korean response. Surprisingly, when you translate the Chinese character back into Korean, the sentence makes perfect sense. This suggests that the mis-generation isn't entirely off—it seems that words with similar meanings in different languages are positioned closely in the model’s latent vector space, causing occasional confusion.

Korean-Continual Pretrained

Multilingual LLMs that have undergone additional pretraining on Korean corpora

After all the experimentation, the community realized that while training models from scratch may yield the best results, a more sustainable approach is to continually pretrain an existing multilingual model. This helps to add Korean cultural knowledge and address the code-switching issue. Two key players have led the earlier efforts in Korean continual pretraining efforts.

The first is Beomi, who has open-sourced a wide range of Korean-adapted models using various techniques such as vocabulary expansion, depth-up scaling, chat vectors, and multi-modal model merging. Some of his models train up to 80B new Korean tokens. He has also provided intermediate checkpoints for several of his models, facilitating further research.

On the other hand, EEVE-Korean, developed by Yanolja, a South Korean tech startup, demonstrates significant performance improvements with just 2B tokens of continual pretraining. Their approach involves a step-by-step training method where different parts of the model are selectively frozen during each phase, allowing for more efficient and targeted training. To help those aiming to re-implement this research, I released a Korean corpora of 2B tokens, so have a look.

eeve

Figure 4: Image from "Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models"

Lastly, there have been recent efforts to continually pretrain Llama-3.1-405B. While I haven’t tried it out yet, I’m sure it was an incredibly expensive project, and I’m excited to see the results.

Conclusion

While I’ve introduced a variety of models in this post, I’ve intentionally left out details regarding their performance. I felt it would be more appropriate to first introduce the different evaluation tools and benchmarks used for Korean language models before diving into their results. Therefore, I’ll be covering these topics in the next post, where I’ll summarize the key benchmarking tools currently used to evaluate Korean LLMs.