--- language: - it pipeline_tag: text-generation tags: - Italia - iGenius - pytorch license: mit license_link: https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1/blob/main/LICENSE extra_gated_fields: First Name: text Last Name: text Email: text Country: country Company: text Role: type: select options: - Employee - Freelancer - Student geo: ip_location By clicking “Submit” below I accept the terms of the license and acknowledge that the information I provide will be collected, stored, processed and shared in accordance with the iGenius Privacy Policy: checkbox extra_gated_description: https://www.igenius.ai/legal/transparency extra_gated_button_content: Submit widget: - example_title: Greetings messages: - role: system content: Il tuo nome è Italia. Sei un modello di linguaggio addestrato da iGenius su Leonardo, uno dei supercomputer più potenti al mondo. - role: user content: Ciao come stai? - example_title: Programming messages: - role: system content: Il tuo nome è Italia. Sei un modello di linguaggio addestrato da iGenius su Leonardo, uno dei supercomputer più potenti al mondo. - role: user content: Scrivi una funzione python che genera numeri random. inference: parameters: max_new_tokens: 300 temperature: 0.3 stop: - <|assistant|> - <|user|> - <|system|> - --- # Italia 9B - Instruct v0.1 ![Italia 9B](https://cdn.prod.website-files.com/650a9755f11b20f057c78e52/666c3554b3b2b87c59bd744e_666156724e914be6b8bb0ad9_modello-italia-launch-cover.jpeg "Italia 9B") ## Introduction _For more details on Italia and iGenius, please visit our [website](https://www.igenius.ai/) and read our [release blog post](https://www.igenius.ai/press-releases/introducing-modello-italia-our-first-italian-foundational-llm-open-source)._ _Subscribe to our [newsletter](https://www.igenius.ai/?utm_source=huggingface&utm_medium=modelcard&utm_campaign=italia_9B#newsletter) to receive updates on our latest AI model advancements._ Italia is a family of Open Source large language models developed by iGenius, designed for companies operating in the public and private sectors. The first model in our series is Italia 9B, a foundational LLM with a 9-billion-parameter Transformer architecture, developed in collaboration with Cineca and released under the MIT license. The Italia family of models has been designed for companies operating in **highly regulated sectors**, such as financial services or public administration. Even in its first version, it is a a unique LLM: although specialized in just one single language, the high number of parameters combined with the quality of the training process makes it the ideal choice for the most critical use cases in the enterprise world, where the reliability of generated content is of paramount importance. As the name suggests, Italia is equipped with excellent linguistic formulation capabilities in Italian. This doesn't just include vocabulary and sentence structure, but also cultural and historical knowledge of the country, which are essential for applications requiring advanced proficiency in the Italian language. Data security and information reliability have always been priorities for iGenius. We have invested in building a high-quality Italian dataset to develop a truly open, transparent, and secure language model, in compliance with European AI regulations such as the AI Act. **Terms of Use**: [link](https://secure.igenius.ai/legal/italia_terms_and_conditions.pdf)\ **Authors**: The iGenius Team \ **Model release date**: 04 July 2024 \ **Status**: Visit the [iGenius website](https://igenius.ai) for more information and updates. This is a static model trained on an offline dataset. Future versions of the fine-tuned models will be released as we improve model safety based on community feedback. ## Hardware and Software Thanks to the partnership with Cineca, we had the opportunity to train and fine-tune Italia 9B on a large scale using thousands of GPUs on the Leonardo supercomputer, one of the most advanced and high-performing computing infrastructures in the world. ## Training Italia 9B was trained from scratch in Italian on trillions of tokens, using a heterogeneous mix of data: public sources, synthetic data, and domain-specific content provided by our commercial partners. Trained exclusively in Italian, without translations from English, Italia 9B can understand all Italian linguistic and cultural nuances with unprecedented precision. More than 90% of the pre-training data for Italia consists of Italian text, with the remaining portion in English. This enables Italia to be fully proficient in English and perform well in translation tasks. Additionally, the model has undergone a post-training process that includes both supervised fine-tuning and direct preference optimization to enhance instruction-following capabilities and ensure robust safety measures. The pretraining data has a cutoff date of December 2023, meaning that all the textual information used to train the model was collected and included up until that point. This ensures that the model is equipped with the most recent linguistic and contextual knowledge available at the time of training, enhancing its relevance and accuracy in understanding and generating text based on contemporary language usage. ## Benchmarks All existing benchmarks for evaluating the performance of language models are specifically designed for the English-speaking ecosystem. The questions used in these benchmarks reflect elements, concepts, and structures typical of American and British cultures, which are not represented in native Italian training sources. We are collaborating with leading institutions in Italy to **develop a benchmarking system** tailored specifically for evaluating native Italian models. However, Italia demonstrated nearly state-of-the-art performance among models of a similar size when assessed against benchmarks testing common sense, language understanding, and logical reasoning. Here are the benchmark results generated with llm-harness.
Model | Italia 9B - Instruct - v0.1 |
xcopa_it | 0.73 |
lambada_openai_mt_it (perplexity) | 40.6 |
lambada_openai_mt_it (acc) | 0.43 |
m_mmlu_it (5-shot) | 0.42 |
arc_it (5-shot) | 0.43 |
belebele_ita_Latn (5-shot) | 0.46 |
hellaswag_it (5-shot) | 0.55 |
truthfulqa_it_mc1 (0-shot) | 0.30 |
truthfulqa_it_mc2 (0-shot) | 0.42 |