Skip to main content

The Evaluation Crisis: Why AI Benchmarks Are the New Bottleneck (and How to Solve It)

September 9, 2025

At the recent Databricks Data + AI Summit in San Francisco, CEO Ali Ghodsi painted a picture of an industry in rapid transformation. Speaking to CNBC about the bottlenecks to artificial intelligence (AI) adoption that CEOs and CIOs aren’t discussing enough, Ghodsi shared a well-known industry challenge: evaluation. His concern cuts to the heart of the problem facing every organization rushing to deploy AI agents: How do you trust a system you can’t properly measure?

The Speed of Autonomy vs. the Need for Accountability

The shift toward agentic AI represents more than an incremental improvement – it’s a fundamental change in how technology infrastructure is built and managed. We’re witnessing the birth of truly autonomous systems that create, modify and manage themselves with minimal human intervention.

But autonomy without accountability is a recipe for disaster. “It doesn’t matter if an agent can ace a programming contest,” Ghodsi emphasized. “We want it to do a specific job at the company. But how do we know how it’s doing?”

This evaluation gap represents more than a technical challenge – it’s an existential threat to AI adoption. Organizations are caught between competitive pressure to deploy AI agents and the risk of unleashing systems they can’t properly monitor or control.

The Infrastructure-Evaluation Disconnect

During his summit presentation and CNBC interview, Ghodsi highlighted how companies like Databricks have heavily invested in building the infrastructure for agentic AI, including their recent $1 billion acquisition of Neon to enable serverless, agent-driven database creation. Yet the evaluation ecosystem has lagged dangerously behind.

The result is a growing disconnect: We can build systems that operate at machine speed, but we’re still using human-speed evaluation methods to understand what they’re doing. It’s like trying to referee a Formula 1 race on foot.

This bottleneck isn’t just theoretical. Organizations are already feeling the impact across multiple fronts. Development teams are reluctant to fully embrace AI agents because they can’t demonstrate ROI or reliability to stakeholders. At a recent Microsoft Build event, their AI “junior dev” agent often failed to fix build errors, even after 11 prompts from four engineers. Public developers flagged concerns like “Is the ROI even worth it?” and noted that current agents still need extensive human oversight, negating promised efficiency gains.

Enterprise leaders are hesitating to approve AI initiatives when they can’t quantify success or failure. An Informatica survey found that about two-thirds of companies are still in generative AI pilots, with 97% struggling to show business value from their AI initiatives. This uncertainty creates a cautious approach in which organizations remain stuck in proof-of-concept phases rather than moving toward production deployment.

Compliance and risk teams are blocking deployments because of the inability to audit or trust AI agent behavior. Reuters reports that as enterprises deploy autonomous AI agents, compliance teams flag issues like privacy violations, legal missteps, biased or false outputs, and accountability gaps. Without proper evaluation frameworks, these teams have no choice but to err on the side of caution, effectively stalling AI adoption across entire organizations.

A Path Forward: The Platform Approach

The solution lies in treating benchmarking not as a separate process but as an integrated part of the agentic deployment. Just as cloud computing democratized access to enterprise-grade infrastructure, cloud-based benchmarking democratizes access to enterprise-grade evaluation.

The industry needs to move away from treating benchmarking as an afterthought – something bolted on at the end of development cycles – and instead, integrate it as a core deployment process from the outset. This mirrors the transformation we saw with cloud computing, which took enterprise-grade infrastructure that was once accessible only to tech giants and made it available to any developer with a laptop and credit card.

Now, the same democratization is happening with AI evaluation, in which comprehensive benchmarking capabilities that previously required dedicated teams and months of development are becoming accessible through platforms like Runloop.

This platform-first approach tackles the practical challenges that have historically made robust coding agent evaluation prohibitively expensive and time-consuming. Rather than teams spending months building custom evaluation infrastructure, cloud-native development environments eliminate the technical barriers to testing AI coding performance. The platform standardizes metrics across different coding scenarios – from bug fixes to feature implementation – creating a common language for discussing agent performance that works whether you’re building specialized debugging tools or general-purpose coding assistants. Teams gain instant access to comprehensive test suites, including industry-standard public benchmarks like SWE-bench, along with the ability to create custom scenarios that reflect their specific production requirements.

Most critically, a platform approach enables the shift from episodic testing to continuous monitoring of coding agents in production environments. Instead of running evaluations only during development or when performance issues surface, teams can now track their AI systems’ coding capabilities in real time, catching regressions in code quality, identifying edge cases in their problem-solving approaches, and understanding how model updates impact performance across different programming languages and complexity levels. This continuous evaluation capability becomes essential as coding agents transition from experimental tools to mission-critical components of software development workflows, where maintaining consistent performance and identifying issues before they reach end users can mean the difference between seamless automation and costly production failures.

Download PDF

Scale your AI Infrastructure
solution faster.

Stop building infrastructure. Start building your AI engineering product.