How AI Is Powering Technology and Digital Services in the United States•December 25, 2025•By 3L3C

Build reliable AI agents with AgentKit, modern evals, and RFT. Learn practical workflows U.S. SaaS teams use to scale automation and CX.

AI AgentsSaaS AutomationAI EvaluationCustomer ExperienceAgent FrameworksMLOps

Featured image for Build Better AI Agents With AgentKit and Evals

Build Better AI Agents With AgentKit and Evals

Most companies don’t have an “AI problem.” They have a reliability problem.

You can get an AI agent to demo well in a sandbox in an afternoon. Getting that same agent to behave in production—across messy customer requests, edge cases, and peak-season traffic—usually takes weeks of cleanup work: prompt tweaks, tool-call bug fixes, and the slow realization that you don’t have a real way to measure quality.

That’s why the latest wave of agent tooling—AgentKit, new Evals, and RFT (reinforcement fine-tuning) for agents—matters for U.S.-based SaaS and digital service teams. It’s not about “more AI.” It’s about building agents you can trust, proving they’re improving, and scaling customer-facing automation without your support org paying the price.

A practical rule: if you can’t evaluate an agent, you can’t responsibly deploy it.

This post is part of our series, “How AI Is Powering Technology and Digital Services in the United States.” The focus here is straightforward: how these agent-development building blocks help U.S. companies ship smarter automation, reduce support load, and grow pipeline—without turning customer experience into a science experiment.

AgentKit: the missing “app framework” for AI agents

AgentKit is best understood as an agent development framework that standardizes how agents use tools, follow workflows, and run safely in production. If you’ve built agents by stitching together prompts, tools, and custom glue code, you already know the pain: everything works—until it doesn’t.

In practice, teams need three things to ship agentic systems:

A consistent way to define tools (APIs, databases, internal services) and how the agent calls them
Execution controls so agents don’t loop, overspend tokens, or take risky actions
Observability so you can see what happened when an agent failed and why

AgentKit-style frameworks matter because they push teams away from “prompt-only” prototypes and toward software engineering discipline: typed interfaces, deterministic boundaries, and testable behavior.

What changes when you build with an agent framework

You stop treating the model like the product. The product becomes the system: model + tools + memory + policies + evaluation.

I’ve found that the teams who ship reliable AI automation do a few unsexy things early:

They define tool contracts like real APIs. Inputs validated. Outputs structured. Errors handled.
They limit what the agent can do. Permissions, allowlists, approval steps for sensitive actions.
They log everything that matters. Tool calls, intermediate reasoning artifacts (when appropriate), and final outputs.

This matters because most agent failures aren’t “the model got confused.” They’re “the tool returned null,” “the agent called the wrong endpoint,” or “the workflow didn’t enforce a required check.”

Where AgentKit fits for U.S. digital services

U.S. SaaS and digital service providers are under pressure to do more with lean teams—especially heading into a new budget year. Agent frameworks help because they reduce rebuilds. Once you’ve standardized tool calling, guardrails, and evaluation, each new agent feels more like adding a feature than launching a brand-new product.

Common high-ROI agent use cases include:

Customer support triage (categorize, summarize, route, draft replies)
Account operations (plan changes, refunds with approvals, invoice explanations)
Sales development (research + personalized outreach drafts + CRM updates)
IT service desk (reset flows, access requests, knowledge-base guidance)

New Evals: how you turn agent quality into a measurable KPI

Evals are how you make agent performance measurable, repeatable, and improvable. Without them, “it feels better” becomes your release criteria—which is how customer experience quietly degrades.

The modern agent stack needs evals that go beyond “is the answer correct?” because agents do more than answer. They:

choose tools,
follow multi-step workflows,
handle refusals and policy boundaries,
and must behave consistently across many turns.

A strong eval suite typically includes both offline (pre-release) and online (production) measurement.

The evals most teams should run first

If you’re a U.S.-based SaaS team trying to scale AI customer engagement, start with evals that reflect real business risk:

Tool selection accuracy
- Did the agent pick the right tool for the job?
- Example metric: % of cases where the first tool call is correct.
Tool argument correctness
- Did it pass valid parameters (customer ID, ticket number, SKU)?
- Example metric: % of tool calls that succeed without retries.
Policy and safety compliance
- Does it avoid disallowed actions (PII exposure, financial promises, medical/legal instructions)?
- Example metric: % of responses that meet policy.
Workflow completion rate
- Can it finish a multi-step task without looping or stalling?
- Example metric: % of tasks completed within N steps.
Customer experience scoring
- Tone, clarity, and “did we actually help?”
- Example metric: human-rated helpfulness on a 1–5 rubric.

The reality? A small eval suite that you run every week beats a giant suite you never maintain.

A practical rubric for customer-facing agents

If you need something you can implement quickly, use a 10-point rubric split into five 0–2 scores:

Correctness (0–2): factual and aligned to internal systems
Completeness (0–2): addresses the full request and next step
Tool hygiene (0–2): correct tool usage, no hallucinated actions
Safety/compliance (0–2): respects data handling and policies
Clarity/tone (0–2): readable, calm, and customer-appropriate

This creates an eval score you can trend over time—then tie to operational metrics like deflection rate and CSAT.

RFT for agents: training the behaviors you actually want

RFT (reinforcement fine-tuning) is how you teach an agent to follow your preferred behaviors under pressure—especially in multi-step tool workflows. Prompting can get you 70% of the way. RFT is often what gets you from “pretty good” to “dependable.”

For agentic systems, the goal isn’t to memorize facts. It’s to stabilize decision-making:

when to ask a clarifying question,
when to call a tool,
how to recover from tool errors,
how to follow your policies,
and how to produce outputs in your brand voice.

When RFT is worth it (and when it isn’t)

RFT is worth the effort when:

You have high volume (support, IT tickets, sales ops) and small improvements pay off.
The agent’s work has a clear success signal (completed workflow, correct tool call, safe output).
You can create labeled examples or preference data reliably.

RFT is usually not the first move when:

You’re still changing tools and workflows weekly.
You don’t have agreement on what “good” looks like.
You can’t evaluate outcomes consistently.

A good sequencing I recommend:

Get the workflow stable with AgentKit-style structure.
Add evals until you can detect regressions.
Use RFT to push up the metrics that matter.

Three ways U.S. tech companies are applying this right now

These tools are most valuable when they turn agent experimentation into repeatable delivery. Here are three patterns showing up across U.S. SaaS and digital service teams.

1) Scaling support without burning out humans

Answer-first: Teams are deploying agents to draft responses and resolve simple cases, but they’re winning by measuring outcomes, not by chasing full automation.

A practical support workflow:

Agent reads the ticket and customer history.
Agent proposes a resolution and cites internal knowledge base articles.
Agent either resolves (low-risk) or routes to a human with a structured summary.

What makes this work:

Evals on hallucination rate and tool correctness
Guardrails: refunds, cancellations, and account changes require approval

The result is rarely “90% deflection” overnight. It’s more often faster first response time and fewer back-and-forth cycles—exactly what improves customer experience during seasonal surges.

2) Automating account operations with approvals and audit trails

Answer-first: Account ops is where agents pay off quickly because tasks are structured and tool-driven. Think plan upgrades, invoicing explanations, usage summaries, and renewal prep.

AgentKit-style frameworks help enforce:

step order (verify identity → check eligibility → execute change),
approval gates,
and logs that help with audits.

Evals here are straightforward: did the workflow complete, did the tool call succeed, did we follow policy? That clarity makes RFT especially effective.

3) Improving sales workflows without turning them into spam

Answer-first: Sales agents should be judged on relevance and accuracy, not volume. The fastest way to ruin a brand is to automate outreach that sounds confident and wrong.

A better use:

Agent researches an account and summarizes why they might care.
Agent drafts a message with constraints: no invented claims, cite product capabilities correctly.
Agent logs what it used (CRM notes, product docs) so reps can review.

Evals to run:

factuality against your own product positioning
“no fabricated metrics” compliance
tone consistency (professional, not hype)

What to implement in January: a 30-day agent readiness plan

Answer-first: If you want reliable AI automation in Q1, spend 30 days building an evaluation loop and one production-grade agent workflow.

Here’s a plan that works for many teams.

Week 1: Pick a narrow agent job and define success

Choose one:

support ticket summarization + routing,
refund eligibility checker (with approval),
invoice explainer,
or lead research brief.

Define 3 success metrics, for example:

85% tool-call success rate
0 high-severity policy violations in evaluation set
average human rating ≥ 4/5 on clarity

Week 2: Build tool contracts and guardrails

Make tool inputs strict (IDs, enums, required fields)
Force the agent to output structured fields when possible
Add stop conditions to avoid loops

A memorable rule: If a tool can do damage, put a human in the middle.

Week 3: Create evals from real conversations

Pull 50–200 anonymized historical cases
Write expected outcomes (correct route, correct action, correct summary)
Add a rubric for tone and helpfulness

Week 4: Run evals weekly and train where it counts

Compare prompts/tooling changes against eval scores
If results plateau, consider RFT for the behaviors you can measure

This cadence turns agent development into product iteration instead of endless prompt tinkering.

Where this fits in the bigger “AI in U.S. digital services” story

AI adoption in the United States is shifting from experimentation to operational discipline. The winners aren’t the teams with the flashiest demos. They’re the teams that can answer basic, executive-level questions:

Can we prove quality is improving?
Can we keep customers safe?
Can we scale without ballooning headcount?

AgentKit, modern evals, and RFT for agents all point to the same direction: agents as real software systems, with real measurement and real accountability.

If you’re planning your Q1 roadmap, here’s the bet I’d make: build one agent that’s genuinely reliable, then replicate the pattern across support, ops, and sales. What’s the first workflow in your business where an agent could save time without putting customer trust at risk?

Build Better AI Agents With AgentKit and Evals

Build Better AI Agents With AgentKit and Evals

AgentKit: the missing “app framework” for AI agents

What changes when you build with an agent framework

Where AgentKit fits for U.S. digital services

New Evals: how you turn agent quality into a measurable KPI

The evals most teams should run first

A practical rubric for customer-facing agents

RFT for agents: training the behaviors you actually want

When RFT is worth it (and when it isn’t)

Three ways U.S. tech companies are applying this right now

1) Scaling support without burning out humans

2) Automating account operations with approvals and audit trails

3) Improving sales workflows without turning them into spam

What to implement in January: a 30-day agent readiness plan

Week 1: Pick a narrow agent job and define success

Week 2: Build tool contracts and guardrails

Week 3: Create evals from real conversations

Week 4: Run evals weekly and train where it counts

People also ask: practical questions about agent tooling

How is an AI agent different from a chatbot?

What’s the biggest reason agents fail in production?

Do I need RFT to build a useful agent?

Where this fits in the bigger “AI in U.S. digital services” story