AI Agent Audit: 7 Questions Before Hiring

Before hiring an AI agent development company, ask these seven questions — they separate teams that have shipped production agents from teams that have only built demos.

The AI agent market in 2026 is flooded with vendors who can build a compelling proof of concept in two weeks. The hard part isn't the demo. It's what happens when the agent runs in production, handles edge cases your test data didn't cover, and needs to be trusted with real business decisions. These seven questions expose the gap.

Question 1: Can you show me an agent running in production right now?

This is the filter question. It eliminates 70% of vendors immediately.

A demo running on curated test data proves nothing about production readiness. Production means: the agent handles real user inputs, runs on real infrastructure with uptime requirements, has been operating for more than 30 days, and has a human escalation path when it encounters something unexpected.

Ask for specifics. How many transactions or decisions has the agent processed? What's the error rate? What's the average latency? A vendor who has shipped production agents can answer these without checking their notes. A vendor who has only built demos will pivot to talking about their architecture or their team's credentials.

At Madgeek, we can point to a specific production agent: an AI-powered call quality monitoring system for a contact centre operation that scaled from 50 to 80+ agents in three months. It processes real calls, scores quality in real time, and flags coaching opportunities. It's been running since 2024.

Question 2: How does the agent handle failures and edge cases?

Every AI agent fails. The question is not whether it fails — it's what happens when it does.

A production-ready agent needs three failure mechanisms. First, graceful degradation: when the AI component returns a low-confidence result, the agent routes to a human or falls back to a rule-based decision rather than proceeding with a bad answer. Second, circuit breakers: when an external API (LLM endpoint, database, third-party service) goes down, the agent doesn't crash or hang — it queues the work and retries with backoff. Third, audit trails: every decision the agent makes is logged with the inputs, the reasoning, and the output, so a human can review what happened when something goes wrong.

Ask the vendor to walk you through a specific failure scenario from a production deployment. What went wrong, how did the system handle it, and what changed afterward. Vendors with production experience have war stories. Vendors without them have theoretical frameworks.

Question 3: What's the architecture — and why?

The architecture question isn't about testing technical knowledge. It's about testing whether the vendor makes deliberate design decisions or defaults to whatever framework they learned last.

Ask: why did you choose this LLM versus alternatives? Why this orchestration framework versus building your own? Why this deployment model versus alternatives? Each answer should include a tradeoff — "we chose X because of Y, which means we accepted Z." If every answer is "we use the best technology," the vendor is choosing based on marketing, not engineering judgment.

Specific red flags in the architecture: single-model dependency with no fallback (if OpenAI goes down, your agent goes down), no caching layer for repeated queries (expensive and slow), synchronous processing for tasks that should be asynchronous (the agent blocks while waiting for a 30-second LLM call), and no separation between the agent logic and the LLM calls (making it impossible to switch models later).

A good architecture for most business AI agents in 2026: multi-model with routing (fast model for classification, capable model for complex reasoning), asynchronous task processing, result caching, human-in-the-loop checkpoints for high-stakes decisions, and monitoring on every LLM call.

Question 4: How do you monitor and evaluate agent performance?

A demo doesn't need monitoring. A production agent does. The monitoring question reveals whether the vendor thinks in production terms.

At minimum, a production AI agent needs: latency tracking per step (where is the agent spending time?), cost tracking per interaction (what does each LLM call actually cost?), accuracy evaluation against ground truth (is the agent getting better or worse over time?), and drift detection (are the inputs changing in ways the agent wasn't trained for?).

Ask the vendor what dashboard they provide. If the answer is "we can set something up," they don't have a standard monitoring practice. A vendor who has shipped production agents has a monitoring stack they deploy by default because they've been burned by not having one.

The cost monitoring is particularly important. LLM API costs can spike 10x when the agent hits an edge case that triggers long reasoning chains. Without cost monitoring per interaction, you discover this on your monthly bill — after spending $15,000 on a system budgeted for $1,500.

Question 5: What happens when we need to change the agent's behaviour?

Business rules change. Agent behaviour needs to change with them. How hard is that?

A well-built agent separates business logic from AI logic. Changing "route orders over $10K to senior review" to "route orders over $25K" should be a configuration change, not a code deployment. Changing the agent's tone in customer communications should be a prompt update, not an engineering sprint.

Ask: if our business rules change next quarter, what does it take to update the agent? If the answer is "a development sprint," the agent is hard-coded. If the answer is "we update the configuration and test," the architecture was built for change.

This matters more than most buyers realise. The first version of any AI agent is wrong. Not badly wrong — but the business will learn things from watching the agent work that change how they want the agent to behave. A system that requires engineering time for every adjustment becomes expensive to iterate. A system with configurable business rules improves weekly.

Question 6: What does the handoff to our team look like?

Some vendors build agents and hand them off. Some operate them as a managed service. Some do both. The wrong choice here creates dependency or abandonment.

If you have an engineering team that can maintain the agent post-launch, you want a clean handoff: documented codebase, runbook for common issues, access to all monitoring, and a defined support period (typically 3–6 months) where the vendor is available for escalations. Ask to see a handoff document from a previous project.

If you don't have engineering capacity, you need a managed service agreement: the vendor monitors, maintains, and updates the agent on an ongoing basis. Ask what that costs monthly, what the SLA is, and what happens if you want to switch vendors later. Can you take the code? Is there vendor lock-in?

The worst outcome is a handoff to a team that can't maintain the system, with no managed service agreement. The agent degrades over 6–12 months as models update, APIs change, and business rules evolve — until someone asks "why isn't the AI working anymore?" and the answer is "nobody's been maintaining it."

Question 7: What's the real cost — including ongoing?

The build cost is 40–60% of the first-year total. The rest is LLM API costs, infrastructure, monitoring, and maintenance. A vendor who quotes only the build cost is either inexperienced or deliberately lowballing.

Get a complete cost breakdown. Development: typically $40,000–$150,000 depending on complexity. LLM API costs: $500–$5,000/month depending on volume and model choice. Infrastructure: $200–$2,000/month for hosting, databases, and queues. Monitoring and maintenance: $2,000–$5,000/month if managed, less if handed off.

Ask the vendor to estimate LLM costs based on your expected volume. If they can't — if they don't know how many tokens a typical interaction consumes — they haven't built this kind of system at scale. Token economics are something production teams track obsessively because the difference between GPT-4-class and GPT-3.5-class models can be 10–30x in cost.

At Madgeek, we include cost modelling in every AI agent proposal. The client sees the projected monthly cost at their expected volume, with ranges for growth scenarios. No surprises on month-three invoices.

What do the answers tell you?

Score each question pass/fail. A vendor who passes all seven has production experience, thinks in operational terms, and will build something that works beyond the demo. A vendor who passes four or fewer is building their production skills on your project — which might be acceptable at a lower price point, but you should know that going in.

The single most important question is the first one. Production experience. Everything else follows from it. A team that has shipped and maintained a production AI agent for six months has encountered and solved problems that a demo-stage team hasn't imagined yet.

We've been building production AI systems since 2023 — call quality agents, lead scoring systems, manufacturing cost estimators. Every project taught us something that changed how we build the next one. That accumulation of production knowledge is what separates a quote from a genuine estimate.

Written by

Abhijit Das

CEO

Building AI tools for businesses from legacy to new age SaaS startups

LinkedIn ↗

Building something complex?

Start a project with Madgeek