Datadog tested Codex on real incident PRs and found AI feedback couldāve changed ~22% of outcomes. Hereās how to apply AI code review for reliability.

AI Code Review at Scale: What Datadog Learned
Code review is where reliability either gets protectedāor quietly traded away.
Datadog put numbers behind that idea by testing OpenAIās Codex against its own history of production incidents. In an incident replay harness, Codex produced feedback that Datadog engineers said would have made a difference in more than 10 casesāabout 22% of the incidents examined. Thatās a rare kind of metric in software quality: grounded in real failures, not theory.
This post is part of our āHow AI Is Powering Technology and Digital Services in the United Statesā series, and Datadogās approach is a strong U.S. case study: a major SaaS platform using AI to reduce operational risk in the backend systems customers depend on. The headline isnāt āAI makes engineers faster.ā Itās āAI helps prevent incidents by reviewing code with system-level context.ā
Why AI code review is shifting from linting to risk
Most teams already have linting, SAST, dependency scanning, unit tests, and CI gates. Yet incidents still happen because many failures donāt look like ābad codeā in a single diff.
System-level risk hides in interactions. A seemingly safe change can:
- Break an implicit API contract used by another service
- Increase load or latency in a downstream dependency
- Remove a guardrail that only matters under specific traffic patterns
- Reduce test coverage in the exact area where coupling is highest
Traditional tools struggle here because theyāre deterministic: they flag known patterns. They donāt āunderstandā intent and architecture. Datadogās experience mirrors what Iāve seen across modern distributed systems: the hard bugs are cross-module and cross-service, and reviewers canāt keep all context in their headsāespecially as codebases and orgs scale.
Datadogās bet was straightforward: use an AI coding agent for code review that can reason over the broader system, not just the lines that changed.
How Datadog used Codex to bring system context into every PR
Datadog runs a widely used observability platform. When customer systems fail, Datadog is often the tool people open first. That reality shapes engineering priorities: trust and uptime beat raw development speed.
The workflow: automatic PR review in a major repo
Datadog piloted Codex (OpenAIās coding agent) by integrating it directly into live development workflows. In one of its largest, heavily used repositories, every pull request automatically received a Codex review.
Engineers then reacted with simple signals (thumbs up/down) and shared feedback informally across teams. That detail matters. AI tooling succeeds or fails on adoption, and adoption hinges on whether the feedback feels worth reading.
Datadogās engineers reported that Codex comments were high-signal compared to prior AI review tools that behaved like āadvanced lintersāālots of noise, little insight.
What āsystem-levelā feedback looks like in practice
According to Datadogās internal evaluation, Codex regularly flagged issues that arenāt obvious from the diff alone, including:
- Interactions with modules not touched in the PR (the āblast radiusā problem)
- Missing test coverage in areas where services couple together
- API contract changes that create downstream risk
Datadog Engineering Manager Brad Carter described it like this:
āFor me, a Codex comment feels like the smartest engineer Iāve worked with and who has infinite time to find bugs.ā
Thatās not about perfection. Itās about capacity. AI can repeatedly do the tedious context-walking that humans skip under time pressure.
The incident replay harness: a better way to measure AI review
If you want to know whether AI code review helps, measuring ātime savedā is temptingāand incomplete.
Datadog took the harder route: validate AI review against real incidents.
What Datadog tested
Datadog built an incident replay harness:
- Identify historical production incidents
- Reconstruct the pull requests that contributed to those incidents
- Run Codex on those PRs as if it were reviewing them at the time
- Ask incident owners whether Codexās feedback would have changed the outcome
This setup avoids a common trap: benchmark demos built on toy bugs. Real incidents are messy, system-dependent, and often involve a chain of āreasonableā decisions.
The result: 22% is a big deal
Codex surfaced feedback that engineers said would have made a difference in more than 10 casesāabout 22% of incidents examined.
Two reasons that number stands out:
- These PRs already passed human review. So the bar wasnāt ācatch obvious mistakes.ā It was āfind risk humans missed.ā
- It reframes AI as reliability tooling. Not a replacement for engineers, but a second set of eyes that scales with the codebase.
If you operate a SaaS platform in the U.S. digital economy, 22% isnāt an academic improvement. Itās fewer pages, fewer customer escalations, fewer postmortems, and less trust erosion.
What changes when AI review becomes a reliability system
Datadog didnāt stop at a pilot. After evaluation, it deployed Codex more broadly. As of the case study, more than 1,000 engineers use it regularly.
That scale matters because it suggests the tool crossed the adoption threshold: engineers didnāt just tolerate itāthey engaged with it.
Consistency beats heroics
Traditional āgreat code reviewā often depends on a few senior engineers who understand the systemās history and tradeoffs. Thatās valuableāand brittle.
AI review adds consistency. It doesnāt get tired. It doesnāt context-switch between six urgent PRs. It applies the same level of scrutiny on a Tuesday afternoon as it does before a release.
The practical implication: youāre not hoping the one person who remembers a subtle dependency is available. Youāre building that memory into the workflow.
Engineers spend more time on design, less on detection
Datadogās stance is one I strongly agree with: the goal isnāt to clone the ābest human reviewer.ā The goal is to shift human attention up the stack.
Brad Carter put it clearly:
āItās not about replicating our best human reviewers. Itās about finding critical flaws and edge cases that humans struggle to see when reviewing changes in isolation.ā
When AI reliably catches classes of cross-system risk, humans can focus on:
- Architecture decisions
- Interface clarity and long-term maintainability
- Performance and capacity planning assumptions
- Security threat modeling that requires product context
Thatās a better use of expensive engineering time.
How to implement AI-assisted code review in your org (without creating bot noise)
If youāre leading engineering, DevOps, or platform teams, Datadogās playbook offers a practical path. Hereās what Iād copy.
1) Start where incidents hurt most
Answer first: deploy AI review to the repos that have the highest blast radius, not the easiest codebase.
Pick one or two:
- Core platform services
- Shared libraries used across many teams
- The repo that generates the most incident load or on-call pain
This is where system-level reasoning has the most upside.
2) Make feedback actionable and accountable
AI comments that read like generic advice get ignored. The standard should be: a reviewer can take an action immediately.
Encourage the model (and your prompting wrappers) to:
- Point to the exact function/module and why it matters
- Suggest a test to add (or a missing scenario)
- Describe the potential failure mode (ābreaks pagination for X clientsā)
Also decide up front: when does the team have to respond to an AI comment? My preference is a lightweight rule: if the comment alleges a correctness, reliability, or security issue, you either fix it or explicitly mark why itās not applicable.
3) Measure quality using incident replaysānot vanity metrics
Answer first: incident replay is one of the cleanest ways to evaluate AI code review because it tests against what actually went wrong.
A simple starter version:
- Take the last 20 postmortems
- Rebuild the key PR(s)
- Run AI review with the same repo context
- Have incident owners score: āwould this have changed anything?ā
Track:
- % of incidents where AI would have flagged risk
- False positives that waste reviewer time
- Categories of misses (so you can refine prompts or guardrails)
4) Redesign your code review culture around risk
Fast PR merges feel productiveāuntil they create operational debt.
Datadog reframed review as risk management. Thatās the right frame for U.S. tech companies building digital services at scale, especially in a market where buyers expect uptime, security, and predictable performance.
A practical policy change Iāve found effective:
- Require a short ārisk noteā in PR descriptions for high-impact changes: dependencies touched, rollout plan, and what to monitor.
- Use AI to critique the risk note: āWhatās missing? What could fail?ā
People also ask: AI code review questions leaders bring up
Does AI replace human code review?
Noāand Datadogās data is a good example of why. AI catches risk humans miss, and humans catch product and design issues AI canāt fully judge. The best setup is collaborative: AI for breadth and consistency, humans for intent and tradeoffs.
Will AI code review slow teams down?
It can, if it produces noise. The goal is fewer, higher-quality comments tied to reliability outcomes. Datadogās adoption across 1,000+ engineers suggests the signal-to-noise ratio was acceptableālikely because the feedback connected to real system behavior.
What should AI review focus on first?
Start with cross-service interactions, API contracts, test gaps, and operational risk. Style and formatting are already handled by linters and formatting tools.
What Datadogās Codex case study says about U.S. digital services in 2026
U.S. software companies arenāt adopting AI just to write more code. The more serious trend is using AI to scale judgmentāthe kind of judgment that prevents incidents, protects SLAs, and keeps customer trust intact.
Datadogās results show a practical path: embed an AI coding agent in the workflow, test it against your incident history, and treat code review as a risk function. Thatās how AI becomes part of the reliability stack, not another gadget in the toolchain.
If youāre evaluating AI-assisted code review this quarter, borrow Datadogās standard: donāt ask whether it saves minutes per PR. Ask whether it prevents the next incident youād otherwise spend a week explaining.
What would happen to your on-call load if AI could reliably flag even 1 out of 5 incident-causing changes before they ship?