Evaluation benchmark catalog ↗

Summary: Browse the evaluation benchmarks available through LLM Evaluation Jobs

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt Use this file to discover all available pages before exploring further.

Browse the evaluation benchmarks available through LLM Evaluation Jobs

LLM Evaluation Jobs is in Preview for W&B Multi-tenant Cloud. Compute is free during the preview period. Learn more

This page lists the evaluation benchmarks LLM Evaluation Jobs provides by category.

To run certain benchmarks, a team admin must add the required API keys as team-scoped secrets. Any team member can specify the secret when configuring an evaluation job.

If a benchmark has true in the OpenAI Model Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the Scorer API key field to the secret.
If a benchmark has a link in the Gated Hugging Face Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.

Knowledge#

Evaluate factual knowledge across various domains like science, language, and general reasoning.

Evaluation	Task ID	OpenAI Scorer	Gated Hugging Face Dataset	Description
BoolQ	`boolq`			Boolean yes/no questions from natural language queries
GPQA Diamond	`gpqa_diamond`			Graduate-level science questions (highest quality subset)
HLE	`hle`		Yes	Human-level evaluation benchmark
Lingoly	`lingoly`		Yes	Linguistics olympiad problems
Lingoly Too	`lingoly_too`		Yes	Extended linguistics challenge problems
MMIU	`mmiu`			Massive Multitask Language Understanding benchmark
MMLU (0-shot)	`mmlu_0_shot`			Massive Multitask Language Understanding without examples
MMLU (5-shot)	`mmlu_5_shot`			Massive Multitask Language Understanding with 5 examples
MMLU-Pro	`mmlu_pro`			More challenging version of MMLU
ONET M6	`onet_m6`			Occupational knowledge benchmark
PAWS	`paws`			Paraphrase adversarial word substitution
SevenLLM MCQ (English)	`sevenllm_mcq_en`			Multiple choice questions in English
SevenLLM MCQ (Chinese)	`sevenllm_mcq_zh`			Multiple choice questions in Chinese
SevenLLM QA (English)	`sevenllm_qa_en`			Question answering in English
SevenLLM QA (Chinese)	`sevenllm_qa_zh`			Question answering in Chinese
SimpleQA	`simpleqa`	Yes		Straightforward factual question answering
SimpleQA Verified	`simpleqa_verified`			Verified subset of SimpleQA with validated answers
WorldSense	`worldsense`			Evaluates understanding of world knowledge and common sense

Reasoning#

Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.

Evaluation	Task ID	Description
AGIE AQUA-RAT	`agie_aqua_rat`	Algebraic question answering with rationales
AGIE LogiQA (English)	`agie_logiqa_en`	Logical reasoning questions in English
AGIE LSAT Analytical Reasoning	`agie_lsat_ar`	LSAT analytical reasoning (logic games) problems
AGIE LSAT Logical Reasoning	`agie_lsat_lr`	LSAT logical reasoning questions
ARC Challenge	`arc_challenge`	Challenging science questions requiring reasoning (AI2 Reasoning Challenge)
ARC Easy	`arc_easy`	Easier set of science questions from the ARC dataset
BBH	`bbh`	BIG-Bench Hard: challenging tasks from BIG-Bench
CoCoNot	`coconot`	Counterfactual commonsense reasoning benchmark
CommonsenseQA	`commonsense_qa`	Commonsense reasoning questions
HellaSwag	`hellaswag`	Commonsense natural language inference
MUSR	`musr`	Multi-step reasoning benchmark
PIQA	`piqa`	Physical commonsense reasoning
WinoGrande	`winogrande`	Commonsense reasoning via pronoun resolution

Math#

Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.

Evaluation	Task ID	Description
AGIE Math	`agie_math`	Advanced mathematical reasoning from AGIE benchmark suite
AGIE SAT Math	`agie_sat_math`	SAT mathematics questions
AIME 2024	`aime2024`	American Invitational Mathematics Examination problems from 2024
AIME 2025	`aime2025`	American Invitational Mathematics Examination problems from 2025
GSM8K	`gsm8k`	Grade School Math 8K: multi-step math word problems
InfiniteBench Math Calc	`infinite_bench_math_calc`	Mathematical calculations in long contexts
InfiniteBench Math Find	`infinite_bench_math_find`	Finding mathematical patterns in long contexts
MATH	`math`	Competition-level mathematics problems
MGSM	`mgsm`	Multilingual Grade School Math

Code#

Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.

Evaluation	Task ID	Description
BFCL	`bfcl`	Berkeley Function Calling Leaderboard: tests function calling and tool use capabilities
InfiniteBench Code Debug	`infinite_bench_code_debug`	Long-context code debugging tasks
InfiniteBench Code Run	`infinite_bench_code_run`	Long-context code execution prediction

Reading#

Evaluate reading comprehension and information extraction from complex texts.

Evaluation	Task ID	Description
AGIE LSAT Reading Comprehension	`agie_lsat_rc`	LSAT reading comprehension passages and questions
AGIE SAT English	`agie_sat_en`	SAT reading and writing questions with passages
AGIE SAT English (No Passage)	`agie_sat_en_without_passage`	SAT English questions without accompanying passages
DROP	`drop`	Discrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning
RACE-H	`race_h`	Reading comprehension from English exams (high difficulty)
SQuAD	`squad`	Stanford Question Answering Dataset: extractive question answering on Wikipedia articles

Long context#

Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.

Evaluation	Task ID	Description
InfiniteBench KV Retrieval	`infinite_bench_kv_retrieval`	Key-value retrieval in long contexts
InfiniteBench LongBook (English)	`infinite_bench_longbook_choice_eng`	Multiple choice questions on long books
InfiniteBench LongDialogue QA (English)	`infinite_bench_longdialogue_qa_eng`	Question answering over long dialogues
InfiniteBench Number String	`infinite_bench_number_string`	Number pattern recognition in long sequences
InfiniteBench Passkey	`infinite_bench_passkey`	Retrieval of information from long context
NIAH	`niah`	Needle in a Haystack: long-context retrieval test

Safety#

Evaluate alignment, bias detection, harmful content resistance, and truthfulness.

Evaluation	Task ID	OpenAI Scorer	Gated HF Dataset	Description
AgentHarm	`agentharm`	Yes		Tests model resistance to harmful agent behavior and misuse scenarios
AgentHarm Benign	`agentharm_benign`	Yes		Benign baseline for AgentHarm to measure false positive rates
Agentic Misalignment	`agentic_misalignment`			Evaluates potential misalignment in agentic behavior
AHB	`ahb`			Agent Harmful Behavior: tests resistance to harmful agentic actions
AIRBench	`air_bench`			Tests adversarial instruction resistance
BBEH	`bbeh`			Bias Benchmark for Evaluating Harmful behavior
BBEH Mini	`bbeh_mini`			Smaller version of BBEH benchmark
BBQ	`bbq`			Bias Benchmark for Question Answering
BOLD	`bold`			Bias in Open-Ended Language Generation Dataset
CYSE3 Visual Prompt Injection	`cyse3_visual_prompt_injection`			Tests resistance to visual prompt injection attacks
Make Me Pay	`make_me_pay`			Tests resistance to financial scam and fraud scenarios
MASK	`mask`	Yes	Yes	Tests model’s handling of sensitive information
Personality BFI	`personality_BFI`			Big Five personality trait assessment
Personality TRAIT	`personality_TRAIT`		Yes	Comprehensive personality trait evaluation
SOSBench	`sosbench`	Yes		Safety and oversight stress test
StereoSet	`stereoset`			Measures stereotypical biases in language models
StrongREJECT	`strong_reject`			Tests model’s ability to reject harmful requests
Sycophancy	`sycophancy`			Evaluates tendency toward sycophantic behavior
TruthfulQA	`truthfulqa`			Tests model truthfulness and resistance to falsehoods
UCCB	`uccb`			Unsafe Content Classification Benchmark
WMDP Bio	`wmdp_bio`			Tests hazardous knowledge in biology
WMDP Chem	`wmdp_chem`			Tests hazardous knowledge in chemistry
WMDP Cyber	`wmdp_cyber`			Tests hazardous knowledge in cybersecurity
XSTest	`xstest`	Yes		Exaggerated safety test for over-refusal detection

Domain-Specific#

Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.

Evaluation	Task ID	OpenAI Scorer	Description
ChemBench	`chembench`		Chemistry knowledge and problem-solving benchmark
HealthBench	`healthbench`	Yes	Healthcare and medical knowledge evaluation
HealthBench Consensus	`healthbench_consensus`	Yes	Healthcare questions with expert consensus
HealthBench Hard	`healthbench_hard`	Yes	Challenging healthcare scenarios
LabBench Cloning Scenarios	`lab_bench_cloning_scenarios`		Laboratory experiment planning and cloning
LabBench DBQA	`lab_bench_dbqa`		Database question answering for lab scenarios
LabBench FigQA	`lab_bench_figqa`		Figure interpretation in scientific contexts
LabBench LitQA	`lab_bench_litqa`		Literature-based question answering for research
LabBench ProtocolQA	`lab_bench_protocolqa`		Laboratory protocol understanding
LabBench SeqQA	`lab_bench_seqqa`		Biological sequence analysis questions
LabBench SuppQA	`lab_bench_suppqa`		Supplementary material interpretation
LabBench TableQA	`lab_bench_tableqa`		Table interpretation in scientific papers
MedQA	`medqa`		Medical licensing exam questions
PubMedQA	`pubmedqa`		Biomedical question answering from research abstracts
SEC-QA v1	`sec_qa_v1`		SEC filing question answering
SEC-QA v1 (5-shot)	`sec_qa_v1_5_shot`		SEC-QA with 5 examples
SEC-QA v2	`sec_qa_v2`		Updated SEC filing benchmark
SEC-QA v2 (5-shot)	`sec_qa_v2_5_shot`		SEC-QA v2 with 5 examples

Multimodal#

Evaluate vision and language understanding combining visual and textual inputs.

Evaluation	Task ID	Description
DocVQA	`docvqa`	Document Visual Question Answering: questions about document images
MathVista	`mathvista`	Mathematical reasoning with visual contexts combining vision and math
MMMU Multiple Choice	`mmmu_multiple_choice`	Multimodal understanding with multiple choice format
MMMU Open	`mmmu_open`	Multimodal understanding with open-ended responses
V*Star Bench Attribute Recognition	`vstar_bench_attribute_recognition`	Visual attribute recognition tasks
V*Star Bench Spatial Relationship	`vstar_bench_spatial_relationship_reasoning`	Spatial reasoning with visual inputs

Instruction Following#

Evaluate adherence to specific instructions and formatting requirements.

Evaluation	Task ID	OpenAI Scorer	Gated HF Dataset	Description
IFEval	`ifeval`			Tests precise instruction-following capabilities

System#

Basic system validation and pre-flight checks.

Evaluation	Task ID	OpenAI Scorer	Gated HF Dataset	Description
Pre-Flight	`pre_flight`			Basic system check and validation test

Next steps#

Evaluate a model checkpoint
Evaluate a hosted API model
View details about specific benchmarks at AISI Inspect Evals

Link last verified June 7, 2026. View original ↗

Source: Weights & Biases Docs

Link last verified: 2026-03-04