Evaluation benchmark catalog

no
Summary: Browse the evaluation benchmarks available through LLM Evaluation Jobs

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt Use this file to discover all available pages before exploring further.

Browse the evaluation benchmarks available through LLM Evaluation Jobs

LLM Evaluation Jobs is in Preview for W&B Multi-tenant Cloud. Compute is free during the preview period. Learn more

This page lists the evaluation benchmarks LLM Evaluation Jobs provides by category.

To run certain benchmarks, a team admin must add the required API keys as team-scoped secrets. Any team member can specify the secret when configuring an evaluation job.

  • If a benchmark has true in the OpenAI Model Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the Scorer API key field to the secret.
  • If a benchmark has a link in the Gated Hugging Face Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.

Knowledge#

Evaluate factual knowledge across various domains like science, language, and general reasoning.

EvaluationTask IDOpenAI ScorerGated Hugging Face DatasetDescription
BoolQboolqBoolean yes/no questions from natural language queries
GPQA Diamondgpqa_diamondGraduate-level science questions (highest quality subset)
HLEhleYesHuman-level evaluation benchmark
LingolylingolyYesLinguistics olympiad problems
Lingoly Toolingoly_tooYesExtended linguistics challenge problems
MMIUmmiuMassive Multitask Language Understanding benchmark
MMLU (0-shot)mmlu_0_shotMassive Multitask Language Understanding without examples
MMLU (5-shot)mmlu_5_shotMassive Multitask Language Understanding with 5 examples
MMLU-Prommlu_proMore challenging version of MMLU
ONET M6onet_m6Occupational knowledge benchmark
PAWSpawsParaphrase adversarial word substitution
SevenLLM MCQ (English)sevenllm_mcq_enMultiple choice questions in English
SevenLLM MCQ (Chinese)sevenllm_mcq_zhMultiple choice questions in Chinese
SevenLLM QA (English)sevenllm_qa_enQuestion answering in English
SevenLLM QA (Chinese)sevenllm_qa_zhQuestion answering in Chinese
SimpleQAsimpleqaYesStraightforward factual question answering
SimpleQA Verifiedsimpleqa_verifiedVerified subset of SimpleQA with validated answers
WorldSenseworldsenseEvaluates understanding of world knowledge and common sense

Reasoning#

Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AGIE AQUA-RATagie_aqua_ratAlgebraic question answering with rationales
AGIE LogiQA (English)agie_logiqa_enLogical reasoning questions in English
AGIE LSAT Analytical Reasoningagie_lsat_arLSAT analytical reasoning (logic games) problems
AGIE LSAT Logical Reasoningagie_lsat_lrLSAT logical reasoning questions
ARC Challengearc_challengeChallenging science questions requiring reasoning (AI2 Reasoning Challenge)
ARC Easyarc_easyEasier set of science questions from the ARC dataset
BBHbbhBIG-Bench Hard: challenging tasks from BIG-Bench
CoCoNotcoconotCounterfactual commonsense reasoning benchmark
CommonsenseQAcommonsense_qaCommonsense reasoning questions
HellaSwaghellaswagCommonsense natural language inference
MUSRmusrMulti-step reasoning benchmark
PIQApiqaPhysical commonsense reasoning
WinoGrandewinograndeCommonsense reasoning via pronoun resolution

Math#

Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AGIE Mathagie_mathAdvanced mathematical reasoning from AGIE benchmark suite
AGIE SAT Mathagie_sat_mathSAT mathematics questions
AIME 2024aime2024American Invitational Mathematics Examination problems from 2024
AIME 2025aime2025American Invitational Mathematics Examination problems from 2025
GSM8Kgsm8kGrade School Math 8K: multi-step math word problems
InfiniteBench Math Calcinfinite_bench_math_calcMathematical calculations in long contexts
InfiniteBench Math Findinfinite_bench_math_findFinding mathematical patterns in long contexts
MATHmathCompetition-level mathematics problems
MGSMmgsmMultilingual Grade School Math

Code#

Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
BFCLbfclBerkeley Function Calling Leaderboard: tests function calling and tool use capabilities
InfiniteBench Code Debuginfinite_bench_code_debugLong-context code debugging tasks
InfiniteBench Code Runinfinite_bench_code_runLong-context code execution prediction

Reading#

Evaluate reading comprehension and information extraction from complex texts.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AGIE LSAT Reading Comprehensionagie_lsat_rcLSAT reading comprehension passages and questions
AGIE SAT Englishagie_sat_enSAT reading and writing questions with passages
AGIE SAT English (No Passage)agie_sat_en_without_passageSAT English questions without accompanying passages
DROPdropDiscrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning
RACE-Hrace_hReading comprehension from English exams (high difficulty)
SQuADsquadStanford Question Answering Dataset: extractive question answering on Wikipedia articles

Long context#

Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
InfiniteBench KV Retrievalinfinite_bench_kv_retrievalKey-value retrieval in long contexts
InfiniteBench LongBook (English)infinite_bench_longbook_choice_engMultiple choice questions on long books
InfiniteBench LongDialogue QA (English)infinite_bench_longdialogue_qa_engQuestion answering over long dialogues
InfiniteBench Number Stringinfinite_bench_number_stringNumber pattern recognition in long sequences
InfiniteBench Passkeyinfinite_bench_passkeyRetrieval of information from long context
NIAHniahNeedle in a Haystack: long-context retrieval test

Safety#

Evaluate alignment, bias detection, harmful content resistance, and truthfulness.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
AgentHarmagentharmYesTests model resistance to harmful agent behavior and misuse scenarios
AgentHarm Benignagentharm_benignYesBenign baseline for AgentHarm to measure false positive rates
Agentic Misalignmentagentic_misalignmentEvaluates potential misalignment in agentic behavior
AHBahbAgent Harmful Behavior: tests resistance to harmful agentic actions
AIRBenchair_benchTests adversarial instruction resistance
BBEHbbehBias Benchmark for Evaluating Harmful behavior
BBEH Minibbeh_miniSmaller version of BBEH benchmark
BBQbbqBias Benchmark for Question Answering
BOLDboldBias in Open-Ended Language Generation Dataset
CYSE3 Visual Prompt Injectioncyse3_visual_prompt_injectionTests resistance to visual prompt injection attacks
Make Me Paymake_me_payTests resistance to financial scam and fraud scenarios
MASKmaskYesYesTests model’s handling of sensitive information
Personality BFIpersonality_BFIBig Five personality trait assessment
Personality TRAITpersonality_TRAITYesComprehensive personality trait evaluation
SOSBenchsosbenchYesSafety and oversight stress test
StereoSetstereosetMeasures stereotypical biases in language models
StrongREJECTstrong_rejectTests model’s ability to reject harmful requests
SycophancysycophancyEvaluates tendency toward sycophantic behavior
TruthfulQAtruthfulqaTests model truthfulness and resistance to falsehoods
UCCBuccbUnsafe Content Classification Benchmark
WMDP Biowmdp_bioTests hazardous knowledge in biology
WMDP Chemwmdp_chemTests hazardous knowledge in chemistry
WMDP Cyberwmdp_cyberTests hazardous knowledge in cybersecurity
XSTestxstestYesExaggerated safety test for over-refusal detection

Domain-Specific#

Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
ChemBenchchembenchChemistry knowledge and problem-solving benchmark
HealthBenchhealthbenchYesHealthcare and medical knowledge evaluation
HealthBench Consensushealthbench_consensusYesHealthcare questions with expert consensus
HealthBench Hardhealthbench_hardYesChallenging healthcare scenarios
LabBench Cloning Scenarioslab_bench_cloning_scenariosLaboratory experiment planning and cloning
LabBench DBQAlab_bench_dbqaDatabase question answering for lab scenarios
LabBench FigQAlab_bench_figqaFigure interpretation in scientific contexts
LabBench LitQAlab_bench_litqaLiterature-based question answering for research
LabBench ProtocolQAlab_bench_protocolqaLaboratory protocol understanding
LabBench SeqQAlab_bench_seqqaBiological sequence analysis questions
LabBench SuppQAlab_bench_suppqaSupplementary material interpretation
LabBench TableQAlab_bench_tableqaTable interpretation in scientific papers
MedQAmedqaMedical licensing exam questions
PubMedQApubmedqaBiomedical question answering from research abstracts
SEC-QA v1sec_qa_v1SEC filing question answering
SEC-QA v1 (5-shot)sec_qa_v1_5_shotSEC-QA with 5 examples
SEC-QA v2sec_qa_v2Updated SEC filing benchmark
SEC-QA v2 (5-shot)sec_qa_v2_5_shotSEC-QA v2 with 5 examples

Multimodal#

Evaluate vision and language understanding combining visual and textual inputs.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
DocVQAdocvqaDocument Visual Question Answering: questions about document images
MathVistamathvistaMathematical reasoning with visual contexts combining vision and math
MMMU Multiple Choicemmmu_multiple_choiceMultimodal understanding with multiple choice format
MMMU Openmmmu_openMultimodal understanding with open-ended responses
V*Star Bench Attribute Recognitionvstar_bench_attribute_recognitionVisual attribute recognition tasks
V*Star Bench Spatial Relationshipvstar_bench_spatial_relationship_reasoningSpatial reasoning with visual inputs

Instruction Following#

Evaluate adherence to specific instructions and formatting requirements.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
IFEvalifevalTests precise instruction-following capabilities

System#

Basic system validation and pre-flight checks.

EvaluationTask IDOpenAI ScorerGated HF DatasetDescription
Pre-Flightpre_flightBasic system check and validation test

Next steps#

Link last verified June 7, 2026. View original ↗
Source: Weights & Biases Docs
Link last verified: 2026-03-04