arxiv:2312.17120

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Published on Dec 28, 2023

· Submitted by

akhaliq on Dec 29, 2023

#3 Paper of the day

Upvote

Authors:

Zengzhi Wang ,

Pengfei Liu

Abstract

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``less is more'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

View arXiv page View PDF Add to collection

Community

SinclairWang

Paper author Dec 29, 2023

Hi, our data is now open source on https://huggingface.co/datasets/GAIR/MathPile.

We sincerely hope to receive your feedback and suggestions for this work, including but not limited to feedback on data quality, comments on the paper, and discussions on technical details, among other aspects. Please feel free to leave any comments below.

stereoplegic

Dec 29, 2023

Why CC-BY-NC-SA?

SinclairWang

Paper author Dec 30, 2023

hi, due to some documents being licensed for non-commercial use, we have released our work under the CC BY-NC SA 4.0 license. Are you looking for a dataset that is more friendly for commercial use?

KeithCu

Dec 31, 2023

I would recommend it. Creative Commons has great licenses, but I would highly consider to allow for commercial use also.

SinclairWang

Paper author Dec 31, 2023

Thanks for your feedback. The version for commercial use is coming soon! :)

librarian-bot

Jan 3

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

SinclairWang

Paper author Jan 5

hi all, the commercial-use version of MathPile is coming. We are considering the licenses for commercial use. Any recommendations?

SinclairWang

Paper author Jan 6

hi all, the commercial-use version of MathPile is out, which is available at https://huggingface.co/datasets/GAIR/MathPile_Commercial

Feel free to train your models and build (commercial) applications. Any feedback is welcome.

Rupert01

Mar 14

When I was trying to preprocess the data I encountered a Key error.

This code was working;
from datasets import load_dataset
import re

def preprocess_latex(document):
# Your preprocessing function here
document = re.sub(r'%.*', '', document)
document = re.sub(r'\[a-zA-Z]+({[^}]*})?', '', document)
document = re.sub(r'\s+', ' ', document).strip()
return document

Preprocess and write to a temporary file

temp_file_path = "temp_preprocessed_dataset.txt"
with open(temp_file_path, 'w', encoding='utf-8') as f:
for example in dataset:
# Assuming 'text' is the key containing LaTeX content; adjust if necessary
preprocessed_text = preprocess_latex(example['text'])
f.write(preprocessed_text + '\n')

It gave me a temp_preprocessed_dataset.txt file with 17gb of data, and then it just automatically stopped and produced this:

KeyError Traceback (most recent call last)
Cell In[11], line 16
13 with open(temp_file_path, 'w', encoding='utf-8') as f:
14 for example in dataset:
15 # Assuming 'text' is the key containing LaTeX content; adjust if necessary
---> 16 preprocessed_text = preprocess_latex(example['text'])
17 f.write(preprocessed_text + '\n')

KeyError: 'text'

I wrote an error file using this to log the problematic entries to a separate file, that file was 1.02gb
with open(temp_file_path, 'w', encoding='utf-8') as f, open('errors_log.txt', 'w', encoding='utf-8') as error_log:
for example in dataset:
if 'text' not in example:
# Log the problematic example for further investigation
error_log.write(str(example) + '\n')
continue
preprocessed_text = preprocess_latex(example['text'])
f.write(preprocessed_text + '\n')

KeithCu

Jan 6

•

edited Jan 6

When I go to that page, I still see references to CC BY-NC-SA 4.0, which is a non-commercial license.

" You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By using this data, you agree to comply with the original usage licenses of all sources contributing to MathPile. If the source data of this dataset is subject to a more restrictive license than CC BY-NC-SA 4.0, then this dataset conforms to that more stringent licensing. In all other scenarios, it is governed by the CC BY-NC-SA 4.0 license. Access to this dataset is granted automatically once you accept the license terms and complete all the required fields below.

By agreeing you accept to share your contact information (email and username) with the repository authors."