How AI Is Powering Technology and Digital Services in the United States•February 3, 2026•By 3L3C

Datadog tested Codex on real incident PRs and found AI feedback could’ve changed ~22% of outcomes. Here’s how to apply AI code review for reliability.

AI code reviewCodexDatadogSREDevOpsSoftware reliabilityDeveloper experience

Featured image for AI Code Review at Scale: What Datadog Learned

AI Code Review at Scale: What Datadog Learned

Code review is where reliability either gets protected—or quietly traded away.

Datadog put numbers behind that idea by testing OpenAI’s Codex against its own history of production incidents. In an incident replay harness, Codex produced feedback that Datadog engineers said would have made a difference in more than 10 cases—about 22% of the incidents examined. That’s a rare kind of metric in software quality: grounded in real failures, not theory.

This post is part of our “How AI Is Powering Technology and Digital Services in the United States” series, and Datadog’s approach is a strong U.S. case study: a major SaaS platform using AI to reduce operational risk in the backend systems customers depend on. The headline isn’t “AI makes engineers faster.” It’s “AI helps prevent incidents by reviewing code with system-level context.”

Why AI code review is shifting from linting to risk

Most teams already have linting, SAST, dependency scanning, unit tests, and CI gates. Yet incidents still happen because many failures don’t look like “bad code” in a single diff.

System-level risk hides in interactions. A seemingly safe change can:

Break an implicit API contract used by another service
Increase load or latency in a downstream dependency
Remove a guardrail that only matters under specific traffic patterns
Reduce test coverage in the exact area where coupling is highest

Traditional tools struggle here because they’re deterministic: they flag known patterns. They don’t “understand” intent and architecture. Datadog’s experience mirrors what I’ve seen across modern distributed systems: the hard bugs are cross-module and cross-service, and reviewers can’t keep all context in their heads—especially as codebases and orgs scale.

Datadog’s bet was straightforward: use an AI coding agent for code review that can reason over the broader system, not just the lines that changed.

How Datadog used Codex to bring system context into every PR

Datadog runs a widely used observability platform. When customer systems fail, Datadog is often the tool people open first. That reality shapes engineering priorities: trust and uptime beat raw development speed.

The workflow: automatic PR review in a major repo

Datadog piloted Codex (OpenAI’s coding agent) by integrating it directly into live development workflows. In one of its largest, heavily used repositories, every pull request automatically received a Codex review.

Engineers then reacted with simple signals (thumbs up/down) and shared feedback informally across teams. That detail matters. AI tooling succeeds or fails on adoption, and adoption hinges on whether the feedback feels worth reading.

Datadog’s engineers reported that Codex comments were high-signal compared to prior AI review tools that behaved like “advanced linters”—lots of noise, little insight.

What “system-level” feedback looks like in practice

According to Datadog’s internal evaluation, Codex regularly flagged issues that aren’t obvious from the diff alone, including:

Interactions with modules not touched in the PR (the “blast radius” problem)
Missing test coverage in areas where services couple together
API contract changes that create downstream risk

Datadog Engineering Manager Brad Carter described it like this:

“For me, a Codex comment feels like the smartest engineer I’ve worked with and who has infinite time to find bugs.”

That’s not about perfection. It’s about capacity. AI can repeatedly do the tedious context-walking that humans skip under time pressure.

The incident replay harness: a better way to measure AI review

If you want to know whether AI code review helps, measuring “time saved” is tempting—and incomplete.

Datadog took the harder route: validate AI review against real incidents.

What Datadog tested

Datadog built an incident replay harness:

Identify historical production incidents
Reconstruct the pull requests that contributed to those incidents
Run Codex on those PRs as if it were reviewing them at the time
Ask incident owners whether Codex’s feedback would have changed the outcome

This setup avoids a common trap: benchmark demos built on toy bugs. Real incidents are messy, system-dependent, and often involve a chain of “reasonable” decisions.

The result: 22% is a big deal

Codex surfaced feedback that engineers said would have made a difference in more than 10 cases—about 22% of incidents examined.

Two reasons that number stands out:

These PRs already passed human review. So the bar wasn’t “catch obvious mistakes.” It was “find risk humans missed.”
It reframes AI as reliability tooling. Not a replacement for engineers, but a second set of eyes that scales with the codebase.

If you operate a SaaS platform in the U.S. digital economy, 22% isn’t an academic improvement. It’s fewer pages, fewer customer escalations, fewer postmortems, and less trust erosion.

What changes when AI review becomes a reliability system

Datadog didn’t stop at a pilot. After evaluation, it deployed Codex more broadly. As of the case study, more than 1,000 engineers use it regularly.

That scale matters because it suggests the tool crossed the adoption threshold: engineers didn’t just tolerate it—they engaged with it.

Consistency beats heroics

Traditional “great code review” often depends on a few senior engineers who understand the system’s history and tradeoffs. That’s valuable—and brittle.

AI review adds consistency. It doesn’t get tired. It doesn’t context-switch between six urgent PRs. It applies the same level of scrutiny on a Tuesday afternoon as it does before a release.

The practical implication: you’re not hoping the one person who remembers a subtle dependency is available. You’re building that memory into the workflow.

Engineers spend more time on design, less on detection

Datadog’s stance is one I strongly agree with: the goal isn’t to clone the “best human reviewer.” The goal is to shift human attention up the stack.

Brad Carter put it clearly:

“It’s not about replicating our best human reviewers. It’s about finding critical flaws and edge cases that humans struggle to see when reviewing changes in isolation.”

When AI reliably catches classes of cross-system risk, humans can focus on:

Architecture decisions
Interface clarity and long-term maintainability
Performance and capacity planning assumptions
Security threat modeling that requires product context

That’s a better use of expensive engineering time.

How to implement AI-assisted code review in your org (without creating bot noise)

If you’re leading engineering, DevOps, or platform teams, Datadog’s playbook offers a practical path. Here’s what I’d copy.

1) Start where incidents hurt most

Answer first: deploy AI review to the repos that have the highest blast radius, not the easiest codebase.

Pick one or two:

Core platform services
Shared libraries used across many teams
The repo that generates the most incident load or on-call pain

This is where system-level reasoning has the most upside.

2) Make feedback actionable and accountable

AI comments that read like generic advice get ignored. The standard should be: a reviewer can take an action immediately.

Encourage the model (and your prompting wrappers) to:

Point to the exact function/module and why it matters
Suggest a test to add (or a missing scenario)
Describe the potential failure mode (“breaks pagination for X clients”)

Also decide up front: when does the team have to respond to an AI comment? My preference is a lightweight rule: if the comment alleges a correctness, reliability, or security issue, you either fix it or explicitly mark why it’s not applicable.

3) Measure quality using incident replays—not vanity metrics

Answer first: incident replay is one of the cleanest ways to evaluate AI code review because it tests against what actually went wrong.

A simple starter version:

Take the last 20 postmortems
Rebuild the key PR(s)
Run AI review with the same repo context
Have incident owners score: “would this have changed anything?”

Track:

% of incidents where AI would have flagged risk
False positives that waste reviewer time
Categories of misses (so you can refine prompts or guardrails)

4) Redesign your code review culture around risk

Fast PR merges feel productive—until they create operational debt.

Datadog reframed review as risk management. That’s the right frame for U.S. tech companies building digital services at scale, especially in a market where buyers expect uptime, security, and predictable performance.

A practical policy change I’ve found effective:

Require a short “risk note” in PR descriptions for high-impact changes: dependencies touched, rollout plan, and what to monitor.
Use AI to critique the risk note: “What’s missing? What could fail?”

What Datadog’s Codex case study says about U.S. digital services in 2026

U.S. software companies aren’t adopting AI just to write more code. The more serious trend is using AI to scale judgment—the kind of judgment that prevents incidents, protects SLAs, and keeps customer trust intact.

Datadog’s results show a practical path: embed an AI coding agent in the workflow, test it against your incident history, and treat code review as a risk function. That’s how AI becomes part of the reliability stack, not another gadget in the toolchain.

If you’re evaluating AI-assisted code review this quarter, borrow Datadog’s standard: don’t ask whether it saves minutes per PR. Ask whether it prevents the next incident you’d otherwise spend a week explaining.

What would happen to your on-call load if AI could reliably flag even 1 out of 5 incident-causing changes before they ship?

AI Code Review at Scale: What Datadog Learned

AI Code Review at Scale: What Datadog Learned

Why AI code review is shifting from linting to risk

How Datadog used Codex to bring system context into every PR

The workflow: automatic PR review in a major repo

What “system-level” feedback looks like in practice

The incident replay harness: a better way to measure AI review

What Datadog tested

The result: 22% is a big deal

What changes when AI review becomes a reliability system

Consistency beats heroics

Engineers spend more time on design, less on detection

How to implement AI-assisted code review in your org (without creating bot noise)

1) Start where incidents hurt most

2) Make feedback actionable and accountable

3) Measure quality using incident replays—not vanity metrics

4) Redesign your code review culture around risk

People also ask: AI code review questions leaders bring up

Does AI replace human code review?

Will AI code review slow teams down?

What should AI review focus on first?

What Datadog’s Codex case study says about U.S. digital services in 2026