Benchmarks

July 24, 2025

HumanEval: When Machines Learned to Code

Abigail Wall

TL;DR: Released in July 2021 by OpenAI [Chen et al., 2021] , HumanEval marked the beginning of systematic code generation evaluation. With 164 hand-crafted Python problems, it introduced the pass@k metric and established the foundation for measuring AI coding capabilities. What started as a simple benchmark for Codex has become the gold standard that every coding model must face—and the results tell a remarkable story of AI's rapid evolution from 0% to 96% accuracy in just three years.

Introduction

In the summer of 2021, something extraordinary was happening in the labs of OpenAI. Mark Chen and his team of 69 researchers were putting the finishing touches on a paper that would fundamentally change how we think about artificial intelligence and programming [Chen et al., 2021] . They had trained a model called Codex—a descendant of GPT-3 fine-tuned on billions of lines of code from GitHub—and they needed a way to measure its capabilities.

The problem was deceptively simple: how do you evaluate whether an AI can actually code? Previous attempts at measuring programming ability had been ad hoc, inconsistent, or focused on narrow domains. Chen and his colleagues needed something different—a benchmark that could capture the essence of programming: taking a natural language description and producing working code.

Their solution was HumanEval, and it would become the most influential benchmark in the history of AI code generation.

The name itself was telling. "HumanEval" suggested that these were problems a human programmer could reasonably solve—not esoteric algorithmic puzzles or academic exercises, but the kind of practical programming tasks that working developers encounter daily. Each problem came with a function signature, a docstring explaining the task, and a set of hidden unit tests to validate correctness.

But here's what made HumanEval revolutionary: it wasn't just about whether the AI could produce syntactically correct code. It was about functional correctness—did the code actually work? This distinction would prove crucial as the field evolved.

Background and Methodology

The story of HumanEval begins with a recognition that would reshape AI evaluation forever. In early 2021, large language models were showing remarkable capabilities across many domains, but code generation remained largely unmeasured territory. Sure, models could complete simple programming tasks, but how well? And compared to what standard?

Chen and his team faced a fundamental challenge: creating a benchmark that was both rigorous and realistic. They needed problems that were challenging enough to differentiate between models, yet solvable enough that success was meaningful. Too easy, and every model would achieve perfect scores. Too hard, and the benchmark would become an exercise in frustration.

Their approach was methodical and human-centered. The team hand-crafted 164 Python programming problems, each designed to test a specific aspect of programming competence [Chen et al., 2021]. These weren't random coding challenges pulled from competitive programming sites. They were carefully constructed to assess language comprehension, algorithmic thinking, and practical programming skills.

Each problem followed a consistent structure that would become the template for countless future benchmarks. A function signature provided the interface—the name, parameters, and return type. A docstring explained what the function should do, often including examples of expected behavior. Hidden unit tests verified that the implementation actually worked correctly.

Consider this example from the benchmark:


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check if in a given list of numbers, any two numbers are closer to each other than a given threshold.

    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # Sort the list to bring potentially close elements next to each other.
    numbers.sort()

    # Iterate through the sorted list and check the difference between adjacent elements.
    for i in range(len(numbers) - 1):
        if numbers[i+1] - numbers[i] < threshold:
            return True

    return False

The elegance was in the simplicity. The problem was clearly stated, the expected behavior was demonstrated through examples, and the solution required both understanding the natural language description and implementing the logic correctly.

But HumanEval's true innovation lay in its evaluation methodology. The team introduced the pass@k metric, which would become the standard way to measure code generation performance [Chen et al., 2021]. Pass@1 measured the probability that a single generated solution would be correct. Pass@10 and pass@100 measured the probability that at least one solution among 10 or 100 attempts would be correct.

This approach acknowledged a crucial insight about how programmers actually work. Rarely does a human programmer get code exactly right on the first try. We iterate, debug, and refine. The pass@k metric captured this reality, recognizing that generating multiple candidate solutions and selecting the best one was a valid and valuable capability.

The evaluation process itself was designed with security in mind. The OpenAI team was acutely aware that they were asking models to generate code that would then be executed. Their evaluation harness included sandboxing and safety measures, though they deliberately commented out the execution calls in the public release to ensure users understood the security implications [OpenAI HumanEval GitHub].

The benchmark's 164 problems were carefully chosen to cover a broad range of programming concepts without being exhaustive. String manipulation, list processing, mathematical calculations, algorithmic thinking—all were represented, but the focus remained on fundamental programming skills rather than specialized knowledge or complex algorithms.

What made HumanEval particularly powerful was its focus on functional correctness rather than stylistic preferences. The benchmark didn't care whether the generated code was elegant, efficient, or followed best practices. It only cared whether the code worked. This pragmatic approach reflected the reality that working code, even if imperfect, was infinitely more valuable than beautiful code that didn't function correctly.

Current State-of-the-Art Results

The numbers tell a story of breathtaking progress. When HumanEval was first released in July 2021, Codex achieved 28.8% pass@1—a score that seemed impressive at the time [Chen et al., 2021]. GPT-3, the foundation model, managed 0%. GPT-J, a contemporary alternative, reached 11.4%. These were the early days of code generation, when getting even a quarter of problems right felt like a breakthrough.

Fast forward to early 2025, and the landscape has transformed beyond recognition. OpenAI's O1 Preview and O1 Mini models both achieve 96.3% pass@1 on the original HumanEval benchmark [EvalPlus Leaderboard]. This represents a 234% improvement over the original Codex results—a rate of progress that would have seemed fantastical just a few years ago.

But the story becomes more nuanced when we examine the EvalPlus results, which add additional test cases to make the evaluation more rigorous. Here, the same O1 models achieve 89% pass@1 [EvalPlus Leaderboard]. The gap between base tests and EvalPlus tests reveals something crucial about the current state of AI code generation: models have become remarkably good at solving the specific problems they're trained on, but robustness—the ability to handle edge cases and variations—remains a challenge.

The current leaderboard reads like a who's who of AI research. Qwen2.5-Coder-32B-Instruct achieves 87.2% on EvalPlus tests, followed closely by GPT-4o at the same score [EvalPlus Leaderboard]. DeepSeek-V3 and GPT-4-Turbo both reach 86.6%. Claude Sonnet 3.5 achieves 81.7%, while various other models cluster in the 70-80% range.

These scores represent more than just numbers—they reflect fundamental advances in how AI systems understand and generate code. The models achieving 90%+ accuracy on base HumanEval tests are demonstrating capabilities that would have been considered science fiction when the benchmark was first released.

Yet the performance patterns reveal interesting insights about the nature of current AI capabilities. The gap between different models is often smaller than might be expected. The difference between the top performer (96.3%) and the tenth-ranked model (around 85%) is significant but not enormous. This suggests that the field may be approaching certain fundamental limits with current architectures and training approaches.

More telling is the consistent gap between base tests and EvalPlus tests across all models. Even the most advanced systems show a 7-8 percentage point drop when faced with more rigorous evaluation. This pattern suggests that current models excel at pattern matching and reproducing solutions to familiar problems but struggle when faced with variations that require deeper understanding or more robust reasoning.

The evolution of performance over time tells its own story. In 2021, achieving 30% pass@1 was noteworthy. By 2022, models were reaching 50-60%. By 2023, the 70-80% range became common among top models. And by 2024-2025, the 90%+ scores that once seemed impossible became routine among leading systems.

This trajectory raises profound questions about the nature of programming and AI capabilities. When machines can solve 96% of carefully crafted programming problems, what does that say about the complexity of programming itself? Are we approaching the limits of what can be measured by current benchmarks, or are we simply seeing the natural progression of increasingly capable AI systems?

The answer may lie in the details of how these high scores are achieved. Current top models don't just generate code—they engage in sophisticated reasoning about problem requirements, consider multiple solution approaches, and often produce code that is not just correct but well-structured and efficient. This represents a qualitative shift from the early days of code generation, when success meant producing any working solution, regardless of quality.

Notable Recent Developments

HumanEval's influence extends far beyond its original 164 problems. In the three and a half years since its release, it has spawned an entire ecosystem of related benchmarks, evaluation frameworks, and research directions that have fundamentally shaped how we think about AI code generation.

The most significant development has been the emergence of EvalPlus, which takes HumanEval's core concept and makes it more rigorous [EvalPlus Leaderboard]. Recognizing that models were achieving suspiciously high scores on the original benchmark, researchers began adding additional test cases to expose edge cases and corner conditions that the original tests missed. The result is a more challenging evaluation that better reflects real-world programming requirements.

EvalPlus reveals the gap between solving specific problems and robust programming capability. While models achieve 96%+ on base HumanEval tests, scores drop to the high 80s when faced with more comprehensive testing. This isn't a failure of the models—it's a more accurate reflection of the complexity of real-world programming, where edge cases and unexpected inputs are the norm rather than the exception.

The benchmark has also inspired numerous domain-specific variations. HumanEval-X extends the evaluation to multiple programming languages, testing whether models can generalize their coding abilities beyond Python [Zheng et al., 2023]. HumanEval+ adds more test cases to the original problems. HumanEval-V introduces visual elements, testing whether models can generate code based on visual inputs.

Perhaps most importantly, HumanEval established the methodological foundation that virtually every subsequent code generation benchmark has followed. The combination of natural language descriptions, function signatures, and automated testing has become the standard approach. The pass@k metric is now ubiquitous in code generation research.

The benchmark's influence on model development has been equally profound. The pursuit of higher HumanEval scores has driven innovations in model architecture, training techniques, and evaluation methods. Companies routinely cite HumanEval performance in their model releases, and improvements on the benchmark are seen as indicators of general progress in AI capabilities.

But this success has also created new challenges. As models have achieved near-perfect scores on the original benchmark, researchers have had to develop increasingly sophisticated evaluation methods to differentiate between systems. The risk of "teaching to the test" has become real, with models potentially optimized for HumanEval performance rather than general programming ability.

The contamination problem has emerged as a particular concern. With HumanEval problems widely available and discussed, there's a risk that models trained on large code datasets may have encountered the problems during training. This has led to the development of contamination detection methods and the creation of new benchmarks with problems that post-date model training cutoffs.

Recent research has also revealed interesting patterns in how models approach HumanEval problems. Analysis of model-generated solutions shows that top-performing systems often produce code that is not just correct but follows good programming practices. They use appropriate variable names, include helpful comments, and structure their solutions in ways that human programmers would recognize as well-crafted.

This evolution reflects a broader shift in the field. Early code generation models were primarily focused on producing any working solution. Current models demonstrate something closer to programming expertise—they understand not just what to code, but how to code well. This represents a qualitative advance that goes beyond simple performance metrics.

The benchmark has also influenced how we think about AI safety and reliability in code generation. The original HumanEval paper included extensive discussion of the security implications of executing AI-generated code [Chen et al., 2021]. This awareness has shaped subsequent research and development, leading to better sandboxing techniques and more careful consideration of the risks associated with automated code generation.

Looking at the broader research landscape, HumanEval's impact can be seen in the proliferation of code-focused AI research. The benchmark provided a clear, measurable target that allowed researchers to track progress and compare approaches. This has accelerated development in ways that would have been impossible without a standardized evaluation framework.

The benchmark has also influenced industry adoption of AI coding tools. The clear performance metrics provided by HumanEval have helped organizations understand the capabilities and limitations of different AI coding assistants. This has informed decisions about which tools to adopt and how to integrate them into development workflows.

Technical Analysis

The patterns emerging from HumanEval performance data reveal fundamental insights about the current state and future trajectory of AI code generation. When we examine not just the headline scores but the underlying patterns of success and failure, we gain a deeper understanding of what these models can and cannot do.

The most striking pattern is the consistency of the performance hierarchy across different evaluation frameworks. Models that perform well on base HumanEval tests also tend to perform well on EvalPlus tests, though with a consistent performance drop. This suggests that the underlying capabilities being measured are real and transferable, not just artifacts of overfitting to specific test cases.

The magnitude of the performance drop from base tests to EvalPlus tests is particularly revealing. The 7-8 percentage point decrease seen across top models indicates a systematic limitation in current approaches. Models excel at solving the specific problems they encounter during training but struggle with variations that require more robust reasoning or handling of edge cases.

This pattern points to a fundamental characteristic of current AI systems: they are powerful pattern matchers that can generalize within familiar domains but have difficulty with true out-of-distribution reasoning. When faced with HumanEval problems that closely match their training distribution, they perform exceptionally well. When faced with variations that require deeper understanding or novel reasoning, performance degrades.

The evolution of scores over time reveals another crucial insight. The rapid improvement from 28.8% to 96.3% over three years represents more than just incremental progress—it suggests that code generation may be particularly amenable to current AI architectures and training methods. The combination of large-scale pre-training on code repositories and instruction tuning appears to be highly effective for this domain.

However, the plateauing effect visible in recent results suggests that we may be approaching certain limits with current approaches. The gap between the top-performing models has narrowed significantly, with multiple systems achieving scores in the 85-95% range. This clustering suggests that further improvements may require fundamentally different approaches rather than incremental refinements.

The pass@k results provide additional insights into model behavior. The substantial improvement from pass@1 to pass@100 scores in the original Codex results (28.8% to 70.2%) demonstrated that models could generate correct solutions but struggled with consistency [Chen et al., 2021]. Current models show much smaller gaps between pass@1 and pass@k scores, indicating improved reliability and consistency in their outputs.

Analysis of the types of problems where models succeed and fail reveals interesting patterns. Models tend to excel at problems involving standard algorithms, common programming patterns, and well-defined mathematical operations. They struggle more with problems requiring creative problem-solving, complex logical reasoning, or handling of ambiguous specifications.

The language-specific aspects of HumanEval performance also provide insights. While the benchmark focuses on Python, the principles it tests—algorithmic thinking, logical reasoning, and code structure—are largely language-agnostic. The success of models on HumanEval suggests that they have developed genuine programming competence rather than just Python-specific pattern matching.

The relationship between model size and HumanEval performance has evolved over time. Early results showed clear scaling benefits, with larger models consistently outperforming smaller ones. Recent results show a more complex picture, with some smaller, specialized models achieving competitive performance with much larger general-purpose systems. This suggests that architectural innovations and training techniques may be as important as raw scale.

The contamination question adds another layer of complexity to the analysis. While some models may have encountered HumanEval problems during training, the consistency of performance patterns across different evaluation frameworks suggests that the measured capabilities are largely genuine. Models that achieve high HumanEval scores also tend to perform well on other code generation benchmarks, indicating transferable skills rather than memorization.

Perhaps most importantly, the HumanEval results provide a window into the broader question of AI capabilities and limitations. The benchmark measures a specific type of intelligence—the ability to understand natural language descriptions and translate them into working code. The dramatic improvements in this area suggest that current AI systems are developing genuine competence in this domain.

However, the limitations revealed by more rigorous testing remind us that current systems, despite their impressive capabilities, still fall short of human-level programming expertise. Human programmers routinely handle edge cases, ambiguous requirements, and novel problem variations that continue to challenge even the most advanced AI systems.

Challenges and Limitations

Despite its foundational importance and widespread adoption, HumanEval faces significant challenges that reflect broader issues in AI evaluation and development. These limitations have become more apparent as models have achieved increasingly high scores, revealing the gap between benchmark performance and real-world programming capability.

The most pressing challenge is the contamination problem. With HumanEval problems widely available since 2021, there's a significant risk that models trained on large code datasets have encountered these problems during training [Chen et al., 2021] This creates a fundamental evaluation challenge: are we measuring genuine programming ability or sophisticated memorization?

The contamination issue is particularly acute because the benchmark's problems are relatively small in number and have been extensively discussed in research papers, blog posts, and online forums. Unlike benchmarks with thousands or millions of examples, HumanEval's 164 problems are few enough that a model could potentially memorize all of them without developing genuine programming skills.

Researchers have attempted to address this through contamination detection methods, but these approaches have their own limitations. Exact string matching can miss paraphrased or slightly modified versions of problems. More sophisticated detection methods may flag legitimate generalizations as contamination. The result is an ongoing uncertainty about the validity of reported scores.

The benchmark's scope presents another significant limitation. HumanEval focuses on relatively simple, self-contained programming problems that can be solved in a few lines of code. Real-world programming involves much more complex challenges: understanding large codebases, debugging existing code, handling incomplete or ambiguous requirements, and integrating multiple systems.

The problems in HumanEval are also artificially clean. They come with clear specifications, well-defined inputs and outputs, and comprehensive test cases. Real programming often involves working with incomplete information, changing requirements, and legacy systems with undocumented behavior. The benchmark's controlled environment, while necessary for consistent evaluation, doesn't capture these messy realities.

The evaluation methodology itself has limitations. The pass@k metric, while useful, reduces the complex question of code quality to a binary pass/fail decision. It doesn't account for code readability, maintainability, efficiency, or security. A solution that barely passes the tests is scored the same as an elegant, efficient implementation.

This binary evaluation approach misses important aspects of programming expertise. Human programmers are evaluated not just on whether their code works, but on how well it works, how easy it is to understand and maintain, and how well it fits into larger systems. HumanEval's focus on functional correctness, while important, captures only one dimension of programming competence.

The benchmark's Python-centric focus, while understandable, limits its generalizability. Programming languages have different paradigms, idioms, and best practices. Success on Python problems doesn't necessarily translate to competence in other languages or programming paradigms. While HumanEval-X has addressed this to some extent, the core benchmark remains tied to Python's specific characteristics.

The static nature of the benchmark presents ongoing challenges. Unlike dynamic evaluation environments that can generate new problems or adapt to model capabilities, HumanEval's fixed set of problems becomes less useful as models achieve near-perfect scores. This has led to the development of EvalPlus and other enhanced versions, but these are essentially patches rather than fundamental solutions.

The benchmark also struggles with the diversity of programming styles and approaches. Many HumanEval problems have multiple valid solutions, but the evaluation framework may not capture this diversity. A model might generate a correct but unconventional solution that reveals genuine programming insight, but this creativity isn't reflected in the binary pass/fail scoring.

Security considerations add another layer of complexity. The original HumanEval implementation deliberately commented out code execution to prevent security risks [OpenAI HumanEval GitHub]. While necessary for safety, this creates a barrier to widespread adoption and may lead to inconsistent evaluation practices across different research groups.

The benchmark's influence on model development has created its own problems. The focus on HumanEval performance may be driving optimization for benchmark-specific capabilities rather than general programming competence. This "teaching to the test" phenomenon could result in models that excel at HumanEval-style problems but struggle with other programming challenges.

The rapid improvement in HumanEval scores has also created a measurement ceiling effect. When multiple models achieve 95%+ accuracy, the benchmark loses its ability to differentiate between systems. This has forced researchers to develop more challenging evaluation frameworks, but it also raises questions about whether the original benchmark has outlived its usefulness.

Perhaps most fundamentally, HumanEval measures only a narrow slice of what it means to be a competent programmer. It doesn't test debugging skills, code review capabilities, system design thinking, or the ability to work with existing codebases. These limitations become more apparent as AI systems are deployed in real-world programming contexts where these broader skills are essential.

The benchmark's treatment of edge cases and error handling is also limited. While EvalPlus has added more test cases, the fundamental approach of testing against predetermined inputs and outputs doesn't capture the full complexity of robust software development. Real programming requires anticipating and handling unexpected conditions that may not be covered by any test suite.

Future Outlook

As we stand at the threshold of 2025, HumanEval's legacy is secure, but its future role in AI evaluation is evolving. The benchmark that once seemed impossibly challenging now serves as a baseline competency test—a necessary but not sufficient measure of AI coding capabilities. The question is no longer whether models can solve HumanEval problems, but what comes next in the evaluation of AI programming abilities.

The trajectory toward near-perfect HumanEval scores suggests that the field is ready for more sophisticated evaluation frameworks. We're already seeing this evolution in benchmarks like SWE-bench, which tests models on real-world software engineering tasks, and BigCodeBench, which evaluates complex, multi-step programming challenges. These newer benchmarks acknowledge that programming competence extends far beyond solving isolated algorithmic problems.

The future of code generation evaluation will likely embrace dynamic, adaptive testing. Instead of fixed problem sets that models can potentially memorize, we'll see systems that generate novel problems on demand, test edge cases systematically, and adapt to model capabilities in real-time. This approach could address many of HumanEval's current limitations while preserving its core insights about measuring functional correctness.

Multi-modal evaluation represents another frontier. As AI systems become capable of processing visual inputs, understanding diagrams, and working with multimedia content, code generation benchmarks will need to evolve accordingly. HumanEval-V has begun this exploration, but we can expect much more sophisticated multi-modal programming challenges in the coming years.

The integration of real-world software development workflows into evaluation frameworks promises to make benchmarks more relevant to practical applications. Future evaluations might test models' abilities to understand existing codebases, participate in code reviews, debug complex systems, and collaborate with human developers. These capabilities are essential for AI systems to be truly useful in professional software development contexts.

The contamination problem will likely drive the development of new evaluation methodologies. We may see the emergence of private evaluation frameworks, similar to those used in competitive programming, where problems are kept secret until evaluation time. Alternatively, we might develop techniques for generating infinite variations of core programming challenges, making memorization impossible while preserving the essential skills being tested.

The evolution of programming languages and paradigms will also influence future evaluation frameworks. As new languages emerge and programming practices evolve, benchmarks will need to adapt to remain relevant. The rise of domain-specific languages, low-code platforms, and AI-assisted development tools will create new categories of programming competence that current benchmarks don't address.

Security and safety considerations will become increasingly important as AI-generated code becomes more prevalent in production systems. Future benchmarks may need to evaluate not just functional correctness but also security properties, performance characteristics, and maintainability. This holistic approach to code quality assessment will require more sophisticated evaluation frameworks than current pass/fail metrics can provide.

The democratization of AI development suggests that future evaluation frameworks will need to be more accessible and interpretable. While HumanEval requires significant technical expertise to implement and interpret, future benchmarks may need to provide clearer insights for non-technical stakeholders who need to understand AI capabilities and limitations.

The relationship between benchmark performance and real-world utility will continue to be a central concern. As models achieve superhuman performance on narrow benchmarks, the challenge will be ensuring that these capabilities translate to practical benefits. This may require the development of evaluation frameworks that more closely mirror actual software development workflows and challenges.

Looking further ahead, we may see the emergence of collaborative evaluation frameworks where AI systems are tested not just on their individual capabilities but on their ability to work effectively with human developers and other AI systems. This collaborative dimension of programming competence is largely unmeasured by current benchmarks but will be crucial for the successful integration of AI into software development teams.

The future may also bring more personalized evaluation approaches. Instead of one-size-fits-all benchmarks, we might develop evaluation frameworks that adapt to specific use cases, programming domains, or organizational needs. A model intended for web development might be evaluated differently from one designed for scientific computing or embedded systems programming.

Despite these future developments, HumanEval's core contributions will remain relevant. The benchmark established fundamental principles—the importance of functional correctness, the value of standardized evaluation, and the need for rigorous testing—that will continue to guide the development of future evaluation frameworks. Its influence on the field extends far beyond its specific problems and metrics.

The benchmark's greatest legacy may be its demonstration that systematic evaluation can accelerate progress in AI capabilities. By providing a clear, measurable target, HumanEval enabled researchers to track progress, compare approaches, and identify areas for improvement. This methodological contribution will continue to influence how we evaluate and develop AI systems across many domains.

As we look toward a future where AI systems become increasingly capable programming partners, HumanEval serves as both a milestone marking how far we've come and a foundation for the more sophisticated evaluation frameworks we'll need to guide continued progress. The journey from 0% to 96% accuracy on HumanEval problems represents just the beginning of AI's transformation of software development.

References

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

EvalPlus Leaderboard. (2024). EvalPlus evaluates AI Coders with rigorous tests. https://evalplus.github.io/leaderboard.html

OpenAI HumanEval GitHub Repository. (2021). Code for the paper "Evaluating Large Language Models Trained on Code". https://github.com/openai/human-eval

Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Wang, Z., Shen, L., Wang, A., Li, Y., Su, T., Yang, Z., & Tang, J. (2023). CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. arXiv preprint arXiv:2303.17568.

July 27, 2025

What to do with AI Benchmark Results

Learn how to interpret and apply AI benchmark results effectively. Discover best practices for analyzing performance, guiding model improvements, and making informed deployment decisions.

I have Opinions on Pass@K - You should too

Pass@K is everywhere in AI coding benchmarks—but is it really the best metric? Dive into a critical look at Pass@K, its strengths, limitations, and why it deserves your attention.