Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist Paper • 2407.08733 • Published Jul 11 • 20
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation Paper • 2312.07424 • Published Dec 12, 2023 • 7
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization Paper • 2306.05087 • Published Jun 8, 2023 • 6
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts Paper • 2306.04528 • Published Jun 7, 2023 • 3