arxiv:2310.10631

Llemma: An Open Language Model For Mathematics

Published on Oct 16, 2023

· Submitted by

akhaliq on Oct 17, 2023

#1 Paper of the day

Upvote

Authors:

Zhangir Azerbayev ,

Hailey Schoelkopf ,

Keiran Paster ,

Marco Dos Santos ,

Stephen McAleer ,

Albert Q. Jiang ,

Stella Biderman ,

Abstract

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

View arXiv page View PDF Add to collection

Community

jonaskg

Oct 17, 2023

This comment has been hidden

ashishpanara

Oct 18, 2023

•

edited Oct 18, 2023

Amazing

librarian-bot

Oct 19, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

TheProjectsGuy

Nov 18, 2023

Introduces Llemma LLM for mathematical reasoning: continue pre-training Code LLaMA on Proof-pile-2 (scientific papers, math data, and math code); releases 7B and 34B modes (latter is better than Google’s Minerva for math problems). Domain-specific language model can give better performance with smaller size. Uses a custom code dataset AlgebricStack, OpenWebMath, arXiv subset of RedPajama, and generic data sources. Uses standard decoder-only LLaMA 2 models (initialized from Code Llama that was trained on code), autoregressive language modeling objective on Proof-Pile-2. Trained with bf16 (mixed precision) using GPT-NeoX, tensor parallelism, ZeRO sharded; also uses Flash Attention 2 for better throughput and lower memory usage; RoPE for long context fine-tuning. Performs better than open models on CoT mathematical problem solving (GSM8k, OCW, SAT, etc.), matches Minerva; better than Code LLaMA for tool use (GSM8k+Python). Best perplexity for 2:4:1 arxiv to web to code mixture. Appendix has dataset creation (composition and processes), evaluation details, and additional results. From Eleuther AI, CMU.