igormolybog
's Collections
evals
updated
Holistic Evaluation of Text-To-Image Models
Paper
•
2311.04287
•
Published
•
11
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
•
2311.07463
•
Published
•
13
Trusted Source Alignment in Large Language Models
Paper
•
2311.06697
•
Published
•
10
DiLoCo: Distributed Low-Communication Training of Language Models
Paper
•
2311.08105
•
Published
•
14
Instruction-Following Evaluation for Large Language Models
Paper
•
2311.07911
•
Published
•
19
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
•
2311.12022
•
Published
•
25
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
183
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper
•
2312.04724
•
Published
•
20
Evaluation of Large Language Models for Decision Making in Autonomous
Driving
Paper
•
2312.06351
•
Published
•
5
PromptBench: A Unified Library for Evaluation of Large Language Models
Paper
•
2312.07910
•
Published
•
15
TrustLLM: Trustworthiness in Large Language Models
Paper
•
2401.05561
•
Published
•
64
OLMo: Accelerating the Science of Language Models
Paper
•
2402.00838
•
Published
•
80
Can Large Language Models Understand Context?
Paper
•
2402.00858
•
Published
•
21
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper
•
2403.03163
•
Published
•
93
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
51
Long-context LLMs Struggle with Long In-context Learning
Paper
•
2404.02060
•
Published
•
35
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
•
2410.05363
•
Published
•
44
LongGenBench: Long-context Generation Benchmark
Paper
•
2410.04199
•
Published
•
17
GLEE: A Unified Framework and Benchmark for Language-based Economic
Environments
Paper
•
2410.05254
•
Published
•
80