Create README.md

4235e3e verified about 1 month ago

4.19 kB

	---
	pipeline_tag: text-generation
	license: other
	language:
	- en
	tags:
	- math
	datasets:
	- internlm/Lean-Workbook
	- internlm/Lean-Github
	---

	# InternLM2.5-Step-Prover

	<div align="center">

	<img src="https://raw.githubusercontent.com/InternLM/InternLM/main/assets/logo.svg" width="200"/>
	<div> </div>
	<div align="center">
	<b><font size="5">InternLM-Math</font></b>
	<sup>
	<a href="https://internlm.intern-ai.org.cn/">
	<i><font size="4">HOT</font></i>
	</a>
	</sup>
	<div> </div>
	</div>

	A state-of-the-art LEAN4 step prover.

	[💻 Github](https://github.com/InternLM/InternLM-Math) [📊Dataset](https://huggingface.co/datasets/internlm/Lean-Github) [📖 Paper](https://arxiv.org/abs/2410.15700)
	</div>

	InternLM2.5-Step-Prover-Critic is a 1.8B critic model which achieves state-of-the-art performances on MiniF2F, ProofNet, and Putnam math benchmarks, showing its formal math proving ability in multiple domains.

	# Dialogue Example
	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained(
	"internlm/internlm2_5-step-prover-critic",
	device_map="cuda",
	torch_dtype=torch.float16,
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-step-prover-critic", trust_remote_code=True)

	chat_1 = [
	{"role": "user", "content": "Which state is closer to 'no goals'?"},
	{"role": "assistant", "content": "no goals"}
	]
	chat_2 = [
	{"role": "user", "content": "Which state is closer to 'no goals'?"},
	{"role": "assistant", "content": "x : ℕ\nh₀ : ↑x + 4 / 100 * ↑x = 598\n⊢ 100 * x = 100 * 575"}
	]

	score1 = model.get_score(tokenizer, chat_1)
	score2 = model.get_score(tokenizer, chat_2)
	print("score1: ", score1)
	print("score2: ", score2)
	```

	# Performance

	## MiniF2F
	\| Method \| Model size \| Pass \| miniF2F-valid \| miniF2F-test \|
	\|--------\|------------\|------\|---------------\|--------------\|
	\| Whole-Proof Generation Methods \|
	\| GPT-4-turbo 0409 \| - \| 64 \| 25.4% \| 23.0% \|
	\| DeepSeekMath-Base \| 7B \| 128 \| 25.4% \| 27.5% \|
	\| DeepSeek-Prover \| 7B \| 1 \| - \| 30.0% \|
	\| \| \| 64 \| - \| 46.3% \|
	\| \| \| 128 \| - \| 46.3% \|
	\| \| \| 8192 \| - \| 48.8% \|
	\| \| \| 65536 \| - \| 50.0% \|
	\| \| \| cumulative \| 60.2% \| 52.0% \|
	\| DeepSeek-Prover-1.5 \| 7B \| 32 \| - \| 63.5% \|
	\| TheoremLlama \| - \| cumulative \| 36.5% \| 33.6% \|
	\| Tree Search Methods \|
	\| COPRA (GPT-3.5) \| - \| 1 \| - \| 9.0% \|
	\| COPRA (GPT-4) \| - \| 1 \| - \| 26.6% \|
	\| DSP(Isabelle) \| 540B \| 100 \| 42.6% \| 38.9% \|
	\| Proof Artifact Co-Training \| 837M \| 1 \| 23.9% \| 24.6% \|
	\| \| \| 8 \| 29.3% \| 29.2% \|
	\| ReProver \| 229M \| 1 \| - \| 25.0% \|
	\| Llemma \| 7B \| 1 \| 26.2% \| 26.2% \|
	\| Llemma \| 34B \| 1 \| 27.9% \| 25.8% \|
	\| Curriculum Learning \| 837M \| 1 \| 33.6% \| 29.6% \|
	\| \| \| 8 \| 41.2% \| 34.5% \|
	\| \| \| 64 \| 47.3% \| 36.6% \|
	\| Hypertree Proof Search \| 600M \| cumulative \| 58.6% \| - \|
	\| \| \| 64 \| - \| 41.0% \|
	\| Lean-STaR \| 7B \| 64 \| - \| 46.3% \|
	\| InternLM2-Math \| 7B \| 1 \| 29.9% \| 30.3% \|
	\| InternLM2-Math-Plus \| 7B \| 1 \| - \| 43.4% \|
	\| InternLM2-Step-Prover \| 7B \| 1 \| 59.8% \| 48.8% \|
	\| InternLM2.5-Step-Prover \| 7B \| 1 \| 55.4% \| 47.3% \|
	\| InternLM2.5-Step-Prover+Critic \| 7B \| 256 \| 69.6% \| 65.9% \|


	## Proofnet & Putnam
	\| Method \| Model size \| Pass \| result \|
	\|--------\|------------\|------\|--------\|
	\| ProofNet benchmark \|
	\| ReProver \| 229M \| 1 \| 13.8% \|
	\| InternLM2-Step-Prover \| 7B \| 1 \| 18.1% \|
	\| InternLM2.5-Step-Prover \| 7B \| 256 \| 27.0% \|
	\| Putnam benchmark \|
	\| GPT-4 \| - \| 10 \| 1/640 \|
	\| COPRA (GPT-4) \| - \| 10 \| 1/640 \|
	\| DSP(Isabelle) \| 540B \| 10 \| 4/640 \|
	\| ReProver \| 229M \| 1 \| 0/640 \|
	\| InternLM2-Step-Prover \| 7B \| 1 \| 5/640 \|
	\| InternLM2.5-Step-Prover \| 7B \| 1 \| 6/640 \|


	# Citation and Tech Report
	```
	@misc{wu2024internlm25stepproveradvancingautomatedtheorem,
	title={InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems},
	author={Zijian Wu and Suozhi Huang and Zhejian Zhou and Huaiyuan Ying and Jiayu Wang and Dahua Lin and Kai Chen},
	year={2024},
	eprint={2410.15700},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2410.15700},
	}
	```