BenchmarkMain Purpose
MMLU (Massive Multitask Language Understanding)Evaluates the model’s multitask understanding ability across 57 academic domains
HELM (Holistic Evaluation of Language Models)A comprehensive evaluation framework developed by Stanford, covering multiple tasks and fairness assessments
BIG-Bench (BB)A large-scale benchmark with over 200 tasks developed by Google
BBH (Big-Bench Hard)A harder subset of BIG-Bench, focusing on challenging tasks
GSM8K (Grade School Math 8K)Evaluates the model’s ability to solve elementary-level math problems
MATHA set of high school and university-level math problems to evaluate problem-solving abilities
HumanEvalA programming ability benchmark developed by OpenAI
ARC (AI2 Reasoning Challenge)A reasoning challenge focused on scientific problem-solving abilities
C-EvalA comprehensive evaluation benchmark designed for Chinese language models
GLUE/SuperGLUEA general standard to assess natural language understanding
TruthfulQATests the accuracy of model responses and reduces hallucinations
FLORESA multilingual benchmark to evaluate machine translation capabilities
AGIEvalTests high-difficulty standardized exam tasks that approach human cognitive abilities
HELLASWAGAssesses the model’s ability to handle common-sense reasoning and logic
WinograndeTests the model’s ability to reason with common-sense knowledge
MT-BenchEvaluates the model’s ability to engage in multi-turn conversations
MLLM Benchmarks (LLaVA)A benchmark to assess large multimodal models’ image understanding capabilities