A Deepdive into Aya Expanse: Advancing the Frontier of Multilinguality

Published October 24, 2024
Update on GitHub

This is a guest blog post by the Cohere For AI team. Cohere For AI is Cohere's research lab that seeks to solve complex machine learning problems.

With the release of the Aya Expanse family, featuring 8B and 32B parameter models, we are addressing one of the most urgent challenges in AI: the lack of highly performant multilingual models that can rival the capabilities of monolingual ones. While AI has made tremendous progress, there remains a stark gap in the performance of models across multiple languages. Aya Expanse is the result of several years of dedicated research at C4AI --- data arbitrage, multilingual preference training, safety tuning, and model merging.

These combined breakthroughs have resulted in new state-of-the-art performance on multilingual. We evaluate our models on a set of evaluations including the Arena-Hard-Auto dataset (paper), translated to the 23 languages which we release for others to use. In pairwise comparison, Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, a model more than 2x its size, setting a new state-of-the-art for multilingual performance. We also release Aya Expanse 8B, which outperforms the leading open-weights models in its parameter class such as Gemma 2 9B, Llama 3.1 8B, and the recently released Ministral 8B with win rates ranging from 60.4% to 70.6%. We observe even larger gains across less challenging evals.

Aya Expanse 8B Win Rates Aya Expanse 8B Language Specific Win Rates vs Gemma 2 9B

We release both models as open weights for the research community, and hope it will further accelerate multilingual progress. In this blog post, we share technical details behind each of the key algorithmic components used in the training pipeline.

Aya Expanse 32B win rates

Avoiding Model Collapse in Synthetic Data

The use of synthetic data – data generated by an expert or “teacher” model to train another model – has become increasingly central to the development of LLMs, particularly as model training has exhausted current data sources. However, for multilingual data, especially with low-resource languages, there are few good examples of teacher models, creating an extra added challenge to leveraging synthetic data. Furthermore, recent research has suggested that an over-reliance on synthetic data leads to model collapse.

In our recent work we demonstrate that these limitations can be addressed through “data arbitrage” – strategically sampling from a pool of teacher models. This approach has important implications as it challenges the traditional reliance on a single-teacher model for generating synthetic data. Instead, data arbitrage leverages performance variations among a pool of models. Although this technique is applicable to any domain, it is particularly suited to the multilingual setting, where the absence of a universally effective teacher that excels across all languages presents significant challenges In the creation of high-quality synthetic multilingual datasets, multilingual arbitrage proves valuable by utilizing a diverse pool of models to strategically sample different parts of the data distribution for improved multilingual generations.

We first train a model pool for groups of languages and employ an Arbiter to evaluate and select the optimal generation. The Arbiter here is an internal reward model (RM) to score the model generations. In Reward-Based Routing, for each prompt in a given language, we generate completions from all models in the pool and score them using the reward model. The completion with the highest score is chosen as the final completion for that prompt. Our 8B model, even at the SFT stage trained with Multilingual Arbitrage, had over 9.1% improvement in win-rate measured against Gemma 2 9B compared to the previous Aya 23 model, demonstrating the effectiveness of this approach in leveraging diverse model strengths across languages.

Step by Step improvements in win rates against Gemma 2 9B

Iteratively Improving with Global Preferences

Following supervised fine-tuning, alignment to human preferences is a key step for training today’s state-of-the-art LLMs. Although heavily adopted, it is known that preference training is already challenging in a monolingual setting. Maximizing gains from preference training in a multilingual setting introduces even more challenges. The vast majority of existing preference datasets are exclusively English and the few existing multilingual preference datasets are often of low-quality. Moreover, modeling many diverse languages simultaneously is known to be a difficult optimization problem where naively optimizing for performance in some languages often leads to regressions in performance in other languages.

In LHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs, we leverage a novel synthetic data generation technique to construct high-quality multilingual preference data pairs by contrasting in-language completions from a highly performant multilingual LLM with lower quality completions translated from English which were generated by a weaker model. This steers our model away from generating low-quality multilingual completions which often contain undesirable artifacts, such as those introduced by poor translation. We show that this method unlocks substantial gains in performance across all languages and often also results in gains for languages not included in the preference training data.

While this work also shows that preference training with online data outperforms its offline variant, during training of Aya Expanse, we found that the combination of first preference-training with offline data followed by preference-training with online data to be better than either online or offline training alone. In the first preference training stage, we train on data curated by taking the highest and lowest reward responses from the Arbitrage stage as the chosen and rejected completions, which makes the first stage of DPO training offline.

After offline preference training, we run online iterative DPO, where we sample multiple online generations for each prompt from the model trained during the last iteration, rank these generations with a Reward Model, and then further train on these preference pairs. For both models, we repeat this process for 3 iterations as we found that going beyond 3 iterations led to minimal gains at the cost of additional re-tuning parameters like regularization coefficient (beta) and sometimes introduced reward hacking behavior. Overall, for Aya Expanse 8B, the combination of offline and online preference training on top of the model trained with arbitrage, led to 7.1% additional gains in win rate against Gemma 2 9B.

Maximizing Performance through Model Merging

A reappearing problem throughout any post-training (and pre-training) pipeline, whether it consists of a single stage such as SFT, or a more complex multi-stage optimization pipeline, such as our pipeline above, is choosing the right data mixtures for training. The intricacies of this process demand considerable effort in fine-tuning hyperparameters and data combinations. Merging multiple models is an alternative approach for enabling complex multi-tasking at a reduced aggregate computational cost. In Aya Expanse, we directly build on the findings of our recent research paper Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning and apply merging in both the Arbitrage phase, and at each iteration of preference training.

When training multiple separate models with the goal of merging, it is important to maximize diversity between checkpoints. However, this should be balanced with ensuring that each individual model within the pool achieves high performance. To balance these objectives, we maximize diversity between checkpoints by training models for different language families. This takes advantage of cross-lingual transfer which often provides significant performance benefits while ensuring that linguistic differences provide sufficient differentiation between checkpoints.

Naively, one could split-train a model for each language and then merge, but this does not achieve the same benefits we observe from cross-lingual transfer. To improve robustness in merging, we include some shared languages across each cluster (here English, Spanish, and French). In the final recipe, we used multiple stages of merging runs trained on different clusters of data, and checkpoints within the same run.

In addition to weighted linear averaging, we experiment with multiple merging techniques, namely SLERP, TIES-merging, and DARE-TIES. However, we found weighted averaging to be the most consistent method. As a result, we use weighted averaging throughout the pipeline. Interestingly, we observed significantly larger gains from merging at the 35B scale compared to the 8B scale – up to 3x. This is inline with recent work suggesting merging to be more effective at scale.

Bringing it all Together

Components

These diagrams show our end-to-end post-training pipeline, which resulted in the step-by-step gains discussed earlier. It is truly special to look back and see how far the Aya model series has come, since its inception with Aya 101 accompanied by the Aya Collection, which stretched the limits of open-source collaboration, to now which combines steady progress in key open fundamental research questions to set a new standard for multilingual performance.

Combined

Acknowledgements

This work wouldn’t have been possible without the core Aya Expanse team: Madeline Smith, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis, Sara Hooker, John Dang, Shivalika Singh, Arash Ahmadian, Daniel D'souza, Alejandro Salamanca, Aidan Peppin, Arielle Bailey, Meor Amer, Sungjin Hong, Manoj Govindassamy, Sandra Kublik.

It also wouldn’t have been possible without the wider Cohere For AI and Cohere team. Special thanks to Acyr Locatelli, Adrien Morisot, Jon Ander Campos, Sara Elsharkawy, Eddie Kim, Julia Kreutzer, Nick Frosst, Aidan Gomez, Ivan Zhang.

A huge thanks also goes to our research community – the 220 language ambassadors from around the world who have been part of this release. Thank you to Sree Harsha Nelaturu, Bhavnick Minhas, Christopher Klamm, Isabella Bicalho Frazeto who contributed notebooks that are accessible on the model Hugging Face cards.

Special thank you to Hugging Face for helping make this come together: Omar Sanseviero, Pedro Cuenca, Vaibhav Srivastav, Lysandre Debut, Aritra Roy Gosthipaty.

References