MMLU (Massive Multitask Language Understanding) | Evaluates the model’s multitask understanding ability across 57 academic domains |
HELM (Holistic Evaluation of Language Models) | A comprehensive evaluation framework developed by Stanford, covering multiple tasks and fairness assessments |
BIG-Bench (BB) | A large-scale benchmark with over 200 tasks developed by Google |
BBH (Big-Bench Hard) | A harder subset of BIG-Bench, focusing on challenging tasks |
GSM8K (Grade School Math 8K) | Evaluates the model’s ability to solve elementary-level math problems |
MATH | A set of high school and university-level math problems to evaluate problem-solving abilities |
HumanEval | A programming ability benchmark developed by OpenAI |
ARC (AI2 Reasoning Challenge) | A reasoning challenge focused on scientific problem-solving abilities |
C-Eval | A comprehensive evaluation benchmark designed for Chinese language models |
GLUE/SuperGLUE | A general standard to assess natural language understanding |
TruthfulQA | Tests the accuracy of model responses and reduces hallucinations |
FLORES | A multilingual benchmark to evaluate machine translation capabilities |
AGIEval | Tests high-difficulty standardized exam tasks that approach human cognitive abilities |
HELLASWAG | Assesses the model’s ability to handle common-sense reasoning and logic |
Winogrande | Tests the model’s ability to reason with common-sense knowledge |
MT-Bench | Evaluates the model’s ability to engage in multi-turn conversations |
MLLM Benchmarks (LLaVA) | A benchmark to assess large multimodal models’ image understanding capabilities |