How We Built a Call Quality AI Agent

An AI call quality agent can monitor 100% of sales calls in real time — scoring each call against quality criteria, flagging issues immediately, and generating coaching summaries — replacing a QA function that would require dedicated headcount to operate at scale. We built one for a client running a large outbound sales operation. The operation scaled from 50 to 80+ agents in 3 months with the AI agent handling quality monitoring that would have otherwise required 6–8 additional QA hires.

This post covers the full build — the problem we were solving, the architecture we chose, the results, what it cost, and what we would do differently. First-person, from the team that built it.

What was the problem?

The client operated an outbound sales floor with 50 agents making hundreds of calls daily. Quality monitoring was manual — a 4-person QA team listened to a random 5% sample of calls, scored them against a 12-point quality checklist, and compiled weekly reports. The checklist covered compliance disclosures, call opening protocols, objection handling quality, prohibited language, closing procedures, and customer sentiment indicators.

The problems were predictable. At 5% sampling, most quality issues went undetected. When issues were caught, they were already days old — the agent had made dozens more calls with the same problem. The weekly reporting cycle meant coaching was reactive, not preventive. And the client wanted to scale to 80+ agents, which would have dropped the sampling rate to under 3% unless they doubled the QA team.

The client's ask was specific: "We need to monitor every call without hiring more QA staff." That constraint — 100% coverage with no additional headcount — defined the project.

What does the agent architecture look like?

The agent has four layers. Each does one job.

Layer 1 — Audio ingestion. The agent connects to the client's telephony system via API. When a call starts, the audio stream is captured and fed into a speech-to-text pipeline. We used a real-time transcription service optimized for conversational speech with speaker diarization — distinguishing between the agent and the prospect. The transcription runs in near-real-time with a 3–5 second delay.

Layer 2 — Quality evaluation. The transcript feeds into an LLM-based evaluation engine. The engine scores the call against each of the 12 quality criteria. Each criterion has a specific definition, passing and failing examples, and a severity level. The LLM evaluates the transcript against each criterion independently, producing a structured score card with a score (pass/partial/fail), the evidence from the transcript, and a confidence level.

Layer 3 — Alert and routing. Violations above the severity threshold trigger immediate notifications. A compliance violation (highest severity) sends an alert to the floor manager within 60 seconds of detection. A quality issue (medium severity) queues for the QA team's daily review. A coaching opportunity (low severity) is logged for the agent's weekly one-on-one.

Layer 4 — Reporting and coaching. The agent generates per-agent performance dashboards, trend analysis (quality scores over time), and coaching summaries. Each coaching summary identifies the agent's top 2–3 improvement areas with specific call examples and suggested coaching talking points.

What technical decisions mattered most?

Three decisions shaped the project's success.

First, we evaluated each quality criterion independently rather than asking the LLM to score the entire call at once. Early testing showed that a single-prompt approach ("score this call on all 12 criteria") produced inconsistent results — the model would anchor on the most obvious issue and under-evaluate other dimensions. Independent evaluation per criterion added processing time but dramatically improved accuracy and consistency.

Second, we built the severity-based routing system from day one instead of treating all violations equally. The floor manager told us: "If there's a compliance violation, I need to know immediately. If someone forgot the opening greeting, that can wait until coaching." That distinction between real-time alerts and batch reviews prevented alert fatigue — the #1 killer of monitoring systems.

Third, we included confidence scores in every evaluation. When the model's confidence was below 80% on a criterion, the evaluation was flagged for human review rather than auto-scored. This caught edge cases — sarcasm, background noise affecting transcription, calls in mixed languages — that would have generated false positives. Roughly 8% of evaluations got flagged for human review in the first month, dropping to under 3% as we tuned the prompts.

What were the results?

The agent went into production monitoring 100% of calls for the existing 50-agent team. Within two weeks, it identified three agents with consistent compliance gaps that the 5% sampling had missed entirely. Those agents received immediate targeted coaching.

Over 3 months, the operation scaled from 50 to 80+ agents. The QA team stayed at 4 people. Without the AI agent, that scale-up would have required 6–8 additional QA hires at $35K–$45K each — $210K–$360K in annual salary that was avoided.

Overall quality scores improved 15% across the floor. The improvement came from two factors: issues caught same-day instead of same-week, and coaching conversations driven by specific call examples rather than general feedback.

The floor manager's feedback, paraphrased: "We used to find out about problems on Friday. Now we find out about them in real time. The agents know every call is being monitored, not just a random sample. That alone changed behavior."

What did this cost to build and run?

Build cost: approximately $60K over 10 weeks. This covered architecture design, telephony integration, transcription pipeline, LLM evaluation engine, alerting system, reporting dashboard, and deployment.

Monthly operating cost at 80 agents: approximately $3,500/month. This breaks down to $2,000 for LLM API calls (evaluating 12 criteria per call across hundreds of daily calls), $800 for transcription API, $400 for infrastructure (compute, storage, CDN), and $300 for monitoring and logging.

The client also retained Madgeek on a $3,000/month maintenance and tuning retainer — handling prompt optimization, adding new quality criteria, and adapting to changes in compliance requirements.

Total first-year cost: approximately $138K ($60K build + $78K operating and maintenance). Compared to the alternative — 6–8 additional QA hires at $210K–$360K per year — the agent paid for itself within the first year with significant margin.

What would we do differently?

Three things we learned that would change our approach on a future build.

First, we would build the coaching summary generation earlier. We added it in week 8, but the QA team was asking for it by week 3. Automated coaching summaries turned out to be more valuable to them than the real-time alerts — because the summaries gave them specific conversation starters for one-on-ones instead of generic feedback.

Second, we would invest more time in transcription quality for the first two weeks. The early transcription errors (proper nouns, technical terms, industry jargon) cascaded into evaluation errors. We spent weeks 4–6 building a custom vocabulary layer for the transcription service. Starting with that would have saved time overall.

Third, we would build a calibration workflow from day one. The QA team needed to review and agree on how the AI scored edge cases. We built this ad hoc during the first month. A structured calibration process — where the QA team reviews a sample of AI scores and provides corrections — should be designed into the agent from the start. It improves accuracy faster and gives the QA team ownership of the quality standards.

Is this approach right for your sales operation?

An AI call quality agent makes sense when three conditions are true. You have 30+ agents making calls (below that, manual QA is sufficient and cheaper). You have defined quality criteria (if you don't have a quality checklist, you need one before you automate monitoring). And you have a telephony system with an API that exposes call audio (most modern systems do — Five9, Twilio, Genesys, RingCentral all support this).

If those three conditions are true, the build typically pays for itself in 6–12 months through avoided QA hires and improved quality outcomes. The real value, though, is scale — the agent lets you grow the operation without linearly growing the QA team.

Written by

Abhijit Das

CEO

Building AI tools for businesses from legacy to new age SaaS startups

LinkedIn ↗

Building something complex?

Start a project with Madgeek