Public Benchmarks

SWE-bench

Evaluates AI agents' ability to solve real-world GitHub issues by producing code edits as patch files. Uses authentic software engineering problems from popular open-source repositories.

Scenarios:

2,294 (SWE-bench Full) / 500 (SWE-bench Verified)

Release Dates:

2023

Attributions:

Carlos E. Jimenez, John Yang, Alexander Wettig

SWE-bench Verified

Human-validated subset of SWE-bench with 500 carefully verified samples, providing more reliable evaluation of AI models' software engineering capabilities.

Scenarios:

500

Release Dates:

August 2024

Attributions:

Carlos E. Jimenez, John Yang, Alexander Wettig (original SWE-bench authors), OpenAI Preparedness Team

Multi-SWE-bench

First multilingual code fix benchmark covering seven programming languages, designed to evaluate large models' self-debugging and code repair capabilities across diverse codebases.

Scenarios:

1,632 total across languages

Release Dates:

April 2025

Attributions:

Daoguang Zan, Zhirong Huang, Wei Liu

OpenAI HumanEval

The original OpenAI benchmark for evaluating large language models trained on code, featuring carefully crafted evaluation sets that measure functional correctness.

Scenarios:

164

Release Dates:

2021

Attributions:

Mark Chen, Jerry Tworek, Heewoo Jun

BigCodeBench

Evaluates LLMs on practical and challenging programming tasks with diverse function calls and complex instructions across 139 Python libraries.

Scenarios:

1,140

Release Dates:

2024

Attributions:

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim

SWE-Smith

Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.

Scenarios:

50,000+ instances

Release Dates:

2025

Attributions:

John Yang, Kyle Leret, Carlos E. Jimenez

DS-1000

Data science code generation benchmark with 1,000 problems spanning seven Python libraries including NumPy, Pandas, and Matplotlib

Scenarios:

1,000

Release Dates:

2022

Attributions:

Yuhang Lai, Chengxi Li, Yiming Wang

MarsCode Agent (ByteDance)

AI-native automated bug fixing agent that achieves state-of-the-art performance on SWE-bench, demonstrating advanced software engineering capabilities.

Scenarios:

Evaluated on SWE-bench (39.33% success rate)

Release Dates:

2024

Attributions:

Yuntong Liu, Peng Gao, Xinyu Wang

Public Benchmarks

SWE-bench

SWE-bench Verified

Multi-SWE-bench

OpenAI HumanEval

BigCodeBench

SWE-Smith

DS-1000

MarsCode Agent (ByteDance)

Everything You Need to Know

Get Started with Public Benchmarks

Product

Company

Legal