Skip to main content
// BENCHMARKS

Introducing Public Benchmarks By Runloop

Evaluate AI coding agents with precision using Runloop's Public Benchmarks. Our platform offers standardized performance metrics that help developers and researchers assess capabilities across different tasks and domains.

dots blue bg

Use Cases

Turn your domain expertise into automated, high-margin AI verification standards across critical industry tasks.

BigCodeBench

Evaluates LLMs on practical and challenging programming tasks with diverse function calls and complex instructions across 139 Python libraries.

Scenarios

1,140

Release Dates

2024

Attributions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim

SWE-Smith

Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.

Scenarios

50,000+ instances

Release Dates

2025

Attributions

John Yang, Kyle Leret, Carlos E. Jimenez

DS-1000

Data science code generation benchmark with 1,000 problems spanning seven Python libraries including NumPy, Pandas, and Matplotlib

Scenarios

1,000

Release Dates

2022

Attributions

Yuhang Lai, Chengxi Li, Yiming Wang

MarsCode Agent (ByteDance)

AI-native automated bug fixing agent that achieves state-of-the-art performance on SWE-bench, demonstrating advanced software engineering capabilities.

Scenarios

Evaluated on SWE-bench (39.33% success rate)

Release Dates

2024

Attributions

Yuntong Liu, Peng Gao, Xinyu Wang

SWE bench

Evaluates AI agents' ability to solve real-world GitHub issues by producing code edits as patch files. Uses authentic software engineering problems from popular open-source repositories.

Scenarios

2,294 (SWE-bench Full) / 500 (SWE-bench Verified

Release Dates

2023

Attributions

Carlos E. Jimenez, John Yang, Alexander Wettig

SWE-bench Verified

Human-validated subset of SWE-bench with 500 carefully verified samples, providing more reliable evaluation of AI models' software engineering capabilities.

Scenarios

500

Release Dates

August 2024

Attributions

Carlos E. Jimenez, John Yang, Alexander Wettig (original SWE-bench authors), OpenAI Preparedness Team

Multi-SWE-bench

First multilingual code fix benchmark covering seven programming languages, designed to evaluate large models' self-debugging and code repair capabilities across diverse codebases.

Scenarios

1,632 total across languages

Release Dates

April 2025

Attributions

Daoguang Zan, Zhirong Huang, Wei Liu

OpenAI HumanEval

The original OpenAI benchmark for evaluating large language models trained on code, featuring carefully crafted evaluation sets that measure functional correctness.Evaluates AI agents´ability to solve real-world GitHub issues by producing code edits as patch files.

Scenarios

164

Release Dates

2021

Attributions

Mark Chen, Jerry Tworek, Heewoo Jun