هذا Ų§Ł„Ł…Ų­ŲŖŁˆŁ‰ غير Ł…ŲŖŲ§Ų­ حتى الآن في نسخة Ł…Ų­Ł„ŁŠŲ© Ł„ Jordan. أنت ŲŖŲ¹Ų±Ų¶ النسخة Ų§Ł„Ų¹Ų§Ł„Ł…ŁŠŲ©.

Ų¹Ų±Ų¶ الصفحة Ų§Ł„Ų¹Ų§Ł„Ł…ŁŠŲ©

AI Code Review at Scale: What Datadog Learned

How AI Is Powering Technology and Digital Services in the United States••By 3L3C

Datadog tested Codex on real incident PRs and found AI feedback could’ve changed ~22% of outcomes. Here’s how to apply AI code review for reliability.

AI code reviewCodexDatadogSREDevOpsSoftware reliabilityDeveloper experience
Share:

Featured image for AI Code Review at Scale: What Datadog Learned

AI Code Review at Scale: What Datadog Learned

Code review is where reliability either gets protected—or quietly traded away.

Datadog put numbers behind that idea by testing OpenAI’s Codex against its own history of production incidents. In an incident replay harness, Codex produced feedback that Datadog engineers said would have made a difference in more than 10 cases—about 22% of the incidents examined. That’s a rare kind of metric in software quality: grounded in real failures, not theory.

This post is part of our ā€œHow AI Is Powering Technology and Digital Services in the United Statesā€ series, and Datadog’s approach is a strong U.S. case study: a major SaaS platform using AI to reduce operational risk in the backend systems customers depend on. The headline isn’t ā€œAI makes engineers faster.ā€ It’s ā€œAI helps prevent incidents by reviewing code with system-level context.ā€

Why AI code review is shifting from linting to risk

Most teams already have linting, SAST, dependency scanning, unit tests, and CI gates. Yet incidents still happen because many failures don’t look like ā€œbad codeā€ in a single diff.

System-level risk hides in interactions. A seemingly safe change can:

  • Break an implicit API contract used by another service
  • Increase load or latency in a downstream dependency
  • Remove a guardrail that only matters under specific traffic patterns
  • Reduce test coverage in the exact area where coupling is highest

Traditional tools struggle here because they’re deterministic: they flag known patterns. They don’t ā€œunderstandā€ intent and architecture. Datadog’s experience mirrors what I’ve seen across modern distributed systems: the hard bugs are cross-module and cross-service, and reviewers can’t keep all context in their heads—especially as codebases and orgs scale.

Datadog’s bet was straightforward: use an AI coding agent for code review that can reason over the broader system, not just the lines that changed.

How Datadog used Codex to bring system context into every PR

Datadog runs a widely used observability platform. When customer systems fail, Datadog is often the tool people open first. That reality shapes engineering priorities: trust and uptime beat raw development speed.

The workflow: automatic PR review in a major repo

Datadog piloted Codex (OpenAI’s coding agent) by integrating it directly into live development workflows. In one of its largest, heavily used repositories, every pull request automatically received a Codex review.

Engineers then reacted with simple signals (thumbs up/down) and shared feedback informally across teams. That detail matters. AI tooling succeeds or fails on adoption, and adoption hinges on whether the feedback feels worth reading.

Datadog’s engineers reported that Codex comments were high-signal compared to prior AI review tools that behaved like ā€œadvanced lintersā€ā€”lots of noise, little insight.

What ā€œsystem-levelā€ feedback looks like in practice

According to Datadog’s internal evaluation, Codex regularly flagged issues that aren’t obvious from the diff alone, including:

  • Interactions with modules not touched in the PR (the ā€œblast radiusā€ problem)
  • Missing test coverage in areas where services couple together
  • API contract changes that create downstream risk

Datadog Engineering Manager Brad Carter described it like this:

ā€œFor me, a Codex comment feels like the smartest engineer I’ve worked with and who has infinite time to find bugs.ā€

That’s not about perfection. It’s about capacity. AI can repeatedly do the tedious context-walking that humans skip under time pressure.

The incident replay harness: a better way to measure AI review

If you want to know whether AI code review helps, measuring ā€œtime savedā€ is tempting—and incomplete.

Datadog took the harder route: validate AI review against real incidents.

What Datadog tested

Datadog built an incident replay harness:

  1. Identify historical production incidents
  2. Reconstruct the pull requests that contributed to those incidents
  3. Run Codex on those PRs as if it were reviewing them at the time
  4. Ask incident owners whether Codex’s feedback would have changed the outcome

This setup avoids a common trap: benchmark demos built on toy bugs. Real incidents are messy, system-dependent, and often involve a chain of ā€œreasonableā€ decisions.

The result: 22% is a big deal

Codex surfaced feedback that engineers said would have made a difference in more than 10 cases—about 22% of incidents examined.

Two reasons that number stands out:

  • These PRs already passed human review. So the bar wasn’t ā€œcatch obvious mistakes.ā€ It was ā€œfind risk humans missed.ā€
  • It reframes AI as reliability tooling. Not a replacement for engineers, but a second set of eyes that scales with the codebase.

If you operate a SaaS platform in the U.S. digital economy, 22% isn’t an academic improvement. It’s fewer pages, fewer customer escalations, fewer postmortems, and less trust erosion.

What changes when AI review becomes a reliability system

Datadog didn’t stop at a pilot. After evaluation, it deployed Codex more broadly. As of the case study, more than 1,000 engineers use it regularly.

That scale matters because it suggests the tool crossed the adoption threshold: engineers didn’t just tolerate it—they engaged with it.

Consistency beats heroics

Traditional ā€œgreat code reviewā€ often depends on a few senior engineers who understand the system’s history and tradeoffs. That’s valuable—and brittle.

AI review adds consistency. It doesn’t get tired. It doesn’t context-switch between six urgent PRs. It applies the same level of scrutiny on a Tuesday afternoon as it does before a release.

The practical implication: you’re not hoping the one person who remembers a subtle dependency is available. You’re building that memory into the workflow.

Engineers spend more time on design, less on detection

Datadog’s stance is one I strongly agree with: the goal isn’t to clone the ā€œbest human reviewer.ā€ The goal is to shift human attention up the stack.

Brad Carter put it clearly:

ā€œIt’s not about replicating our best human reviewers. It’s about finding critical flaws and edge cases that humans struggle to see when reviewing changes in isolation.ā€

When AI reliably catches classes of cross-system risk, humans can focus on:

  • Architecture decisions
  • Interface clarity and long-term maintainability
  • Performance and capacity planning assumptions
  • Security threat modeling that requires product context

That’s a better use of expensive engineering time.

How to implement AI-assisted code review in your org (without creating bot noise)

If you’re leading engineering, DevOps, or platform teams, Datadog’s playbook offers a practical path. Here’s what I’d copy.

1) Start where incidents hurt most

Answer first: deploy AI review to the repos that have the highest blast radius, not the easiest codebase.

Pick one or two:

  • Core platform services
  • Shared libraries used across many teams
  • The repo that generates the most incident load or on-call pain

This is where system-level reasoning has the most upside.

2) Make feedback actionable and accountable

AI comments that read like generic advice get ignored. The standard should be: a reviewer can take an action immediately.

Encourage the model (and your prompting wrappers) to:

  • Point to the exact function/module and why it matters
  • Suggest a test to add (or a missing scenario)
  • Describe the potential failure mode (ā€œbreaks pagination for X clientsā€)

Also decide up front: when does the team have to respond to an AI comment? My preference is a lightweight rule: if the comment alleges a correctness, reliability, or security issue, you either fix it or explicitly mark why it’s not applicable.

3) Measure quality using incident replays—not vanity metrics

Answer first: incident replay is one of the cleanest ways to evaluate AI code review because it tests against what actually went wrong.

A simple starter version:

  • Take the last 20 postmortems
  • Rebuild the key PR(s)
  • Run AI review with the same repo context
  • Have incident owners score: ā€œwould this have changed anything?ā€

Track:

  • % of incidents where AI would have flagged risk
  • False positives that waste reviewer time
  • Categories of misses (so you can refine prompts or guardrails)

4) Redesign your code review culture around risk

Fast PR merges feel productive—until they create operational debt.

Datadog reframed review as risk management. That’s the right frame for U.S. tech companies building digital services at scale, especially in a market where buyers expect uptime, security, and predictable performance.

A practical policy change I’ve found effective:

  • Require a short ā€œrisk noteā€ in PR descriptions for high-impact changes: dependencies touched, rollout plan, and what to monitor.
  • Use AI to critique the risk note: ā€œWhat’s missing? What could fail?ā€

People also ask: AI code review questions leaders bring up

Does AI replace human code review?

No—and Datadog’s data is a good example of why. AI catches risk humans miss, and humans catch product and design issues AI can’t fully judge. The best setup is collaborative: AI for breadth and consistency, humans for intent and tradeoffs.

Will AI code review slow teams down?

It can, if it produces noise. The goal is fewer, higher-quality comments tied to reliability outcomes. Datadog’s adoption across 1,000+ engineers suggests the signal-to-noise ratio was acceptable—likely because the feedback connected to real system behavior.

What should AI review focus on first?

Start with cross-service interactions, API contracts, test gaps, and operational risk. Style and formatting are already handled by linters and formatting tools.

What Datadog’s Codex case study says about U.S. digital services in 2026

U.S. software companies aren’t adopting AI just to write more code. The more serious trend is using AI to scale judgment—the kind of judgment that prevents incidents, protects SLAs, and keeps customer trust intact.

Datadog’s results show a practical path: embed an AI coding agent in the workflow, test it against your incident history, and treat code review as a risk function. That’s how AI becomes part of the reliability stack, not another gadget in the toolchain.

If you’re evaluating AI-assisted code review this quarter, borrow Datadog’s standard: don’t ask whether it saves minutes per PR. Ask whether it prevents the next incident you’d otherwise spend a week explaining.

What would happen to your on-call load if AI could reliably flag even 1 out of 5 incident-causing changes before they ship?