AI Receptionist

Voice AI Development: A Production AI Receptionist on Our Live Phone Line

Azumo designed, built, and deployed an AI receptionist that handles every inbound call to the company's production phone line. Three weeks of development. A custom voice orchestration pipeline that gives Azumo full control over every stage of a live phone conversation.

AI Receptionist

Azumo needed a receptionist that could answer calls around the clock, route callers to the right person in real time, take structured messages, book appointments, and answer company questions from a knowledge base. After evaluating two third-party voice AI platforms and finding both inadequate, Azumo built the entire voice pipeline from raw components and deployed it on its own business phone line. The system has been in production since early 2026.

Results:

1.7s

Median Response Time

P50 pipeline latency across all production conversation turns

76%

Turns Under 2 Seconds

Consistent sub-2-second responses across 512 measured conversation turns

Downtime Events

Zero downtime since production deployment. Available 24 hours a day, 7 days a week.

The Challenge

Azumo's inbound phone line was handled by a third-party answering service. Callers reached a human operator who took a message and emailed it to the team. Response times were slow. Context was lost. Callers who wanted to speak with a specific person had no way to reach them directly.

The first attempt to solve this used a third-party voice AI platform. It was a black box. There was no way to customize pronunciation, no visibility into why responses were slow, and no control over the caller experience. Latency was unpredictable. The system could not perform real-time actions like looking up a contact or checking a calendar.

The second attempt used a telecom provider's built-in conversational AI relay. Latency improved, but the turn-taking was unreliable. The system would interrupt callers mid-sentence or sit in silence after a clear question. It could not handle the ambiguity that real phone conversations require: multiple contacts with the same first name, callers who change their mind mid-sentence, requests that require checking availability before responding.

Neither platform supported the features Azumo needed: real-time contact lookup against a live directory, Slack-based call routing with accept and decline, custom pronunciation for names and industry terms, post-call AI processing, or per-turn latency instrumentation. The core problem was clear. Off-the-shelf platforms optimize for the general case. A receptionist is not a general case.

AI Receptionist

The Solution

Azumo built the entire voice pipeline from raw components: Twilio Media Streams for telephony, Deepgram for speech-to-text, Anthropic Claude for conversation intelligence, and ElevenLabs for text-to-speech. The orchestrator manages real-time audio streaming, conversation state, tool execution, interruption handling, and latency optimization across approximately 11,700 lines of proprietary code.

Seven Real-Time Tools

During a live call, the AI receptionist executes real-time actions without leaving the conversation. It looks up contacts by name against the tenant directory, using fuzzy matching to handle ambiguous names when multiple people share a first name. It sends a real-time Slack notification to the requested contact with Accept and Decline buttons. If the contact accepts, the system transfers the call. If they do not respond, the receptionist takes a structured message. It checks Google Calendar availability and books appointments during the call. It lists available appointment types and ends calls gracefully when the conversation is complete.

Late-Accept Recovery

If a contact misses the initial routing notification and clicks "I'm Available" 15 to 30 seconds later, the system pivots mid-conversation from message-taking to call transfer. This addresses a common real-world scenario that no third-party platform handles: the person was available but saw the notification late.

Pronunciation Control

The system controls pronunciation in both directions. ElevenLabs pronunciation dictionaries ensure the AI says names and terms correctly. Deepgram keyword boosting ensures speech-to-text recognizes those same names when callers say them. An IPA audition tool lets administrators preview and tune pronunciations before they go live.

Post-Call Intelligence

After every call, three parallel AI processes run: a conversation summary, sentiment analysis, and structured message extraction (caller name, reason for calling, callback number, message body, requested contact). A separate high-quality stereo transcription is produced. Notifications dispatch via Slack, email, or SMS with all structured data attached.

Per-Turn Latency Instrumentation

Every conversation turn is measured across the full pipeline: LLM time-to-first-token, TTS processing, and audio encoding. A dedicated latency dashboard shows per-call and per-turn breakdowns, silence distribution histograms, tool versus non-tool turn comparison, and time-series trends. This instrumentation is what allows Azumo to identify bottlenecks and optimize systematically rather than guessing.

The instrumentation also surfaces specific latency drivers. Tool-call overhead, prompt token count, time-of-day API load, and continuation turns after tool execution each contribute measurable, consistent latency. By isolating each factor, Azumo can optimize what it controls and quantify what it cannot.

Results

The AI receptionist has been handling all inbound calls to Azumo's production phone line since early 2026. It answers around the clock, including outside business hours, weekends, and holidays.

Production data from April through May 2026 (512 measured conversation turns, corrected for instrumentation artifacts):

The median caller wait between speaking and hearing a response is 1.7 seconds. 76% of all conversation turns complete in under 2 seconds. The P90 is 2.4 seconds. These numbers include turns where the system executes real-time tool calls (contact lookup, calendar checks, Slack routing), which account for 28% of all turns and add only 184 milliseconds of additional latency compared to plain conversation turns.

The pipeline breaks down as follows: LLM time-to-first-token accounts for 64% of response time at 1,150 milliseconds average. TTS processing accounts for 26% at 467 milliseconds. Audio encoding accounts for 10% at 181 milliseconds. The system achieves a 91% prompt cache hit rate at the turn level, reducing both cost and response variability.

Response time remains flat across call length. A turn late in a 3-minute conversation performs the same as a turn in the first 30 seconds. Tool calls (contact lookup, calendar checks, Slack routing) add an average of 180 milliseconds compared to plain conversation turns. The remaining latency outliers (3.5% of turns exceeding 3 seconds) trace to upstream API response time variability, not pipeline issues. Azumo's instrumentation isolates the cause per-turn, so the team can distinguish between problems it can fix and external factors it cannot.

The system has experienced zero downtime since deployment. Every call produces a structured summary, transcript, and sentiment analysis delivered to the team within seconds of hang-up. Call routing, message-taking, and appointment booking operate without per-call human configuration.

The receptionist runs on the same phone number that prospective clients call to discuss AI projects. It serves as both a production business tool and as primary evidence of Azumo's capability to build, deploy, and operate real-time AI agent systems.

This is how Azumo validates what it sells. The AI receptionist is not a demo or a prototype. It handles the same calls that drive the business. When Azumo tells a client it can build a production AI agent, the proof is on the other end of the phone line.

More Client Work

More Case Studies

Stovell AI

Fintech AI Development: Predictive Analytics for Alpha Generation

Read the Case Study

arrow_outward