GAIR-ProX

community

https://gair-nlp.github.io/ProX/

AI & ML interests

NLP Research

Organization Card

Community About org cards

Clickable Image

GAIR-ProX, a subsidiary of GAIR, spearheads the 🫐 ProX Project. This initiative aims to enhance pre-training efficiency by refining corpus documents using language models at scale. Through meticulous operations (e.g., document-level filtering and chunk-level cleaning), implemented as scalable, executable programs, 🫐 ProX seeks to improve pre-training data quality at scale, ultimately developing more robust and efficient language models.

Read our technical report!

Collections 4

models 14

gair-prox/web-chunk-refining-lm

Text Generation • Updated Oct 10 • 73 • 4

gair-prox/math-chunk-refining-lm

Text Generation • Updated Oct 10 • 32 • 1

gair-prox/math-doc-refining-lm

Text Generation • Updated Oct 10 • 320 • 2

gair-prox/web-doc-refining-lm

Text Generation • Updated Oct 10 • 70 • 4

gair-prox/RedPJ-ProX-1.7B

Updated Oct 10 • 3 • 2

gair-prox/RedPJ-ProX-0.3B

Updated Oct 10 • 15 • 2

gair-prox/C4-ProX-1.7B

Updated Oct 10 • 4 • 1

gair-prox/CodeLlama-7B-ProXMath

Updated Oct 10 • 14 • 1

gair-prox/TinyLlama-1.1B-ProXMath

Updated Oct 10 • 29 • 2

gair-prox/Llama-2-7B-ProXMath

Text Generation • Updated Oct 10 • 36 • 1

datasets 4

gair-prox/RedPajama-pro

Viewer • Updated Sep 26 • 10.2M • 765 • 4

gair-prox/c4-pro

Viewer • Updated Sep 26 • 40.1M • 684 • 5

gair-prox/open-web-math-pro

Viewer • Updated Sep 26 • 2.58M • 938 • 9

gair-prox/FineWeb-pro

Viewer • Updated Sep 26 • 63.1M • 2.54k • 22