Author: Mark Valman, Cloud Solution Architect

The Challenge

A customer approached us with a straightforward request:

They wanted users to dial a phone number and have a natural conversation with an AI agent that could recommend restaurants, answer menu and opening-hours questions, and initiate reservations or delivery orders.

The technical reality was less trivial.

Voice interactions are unforgiving. Even small delays break conversational flow, and stitching together multiple speech and AI services often introduces latency, complexity, and brittle integrations.

The key challenges were:

Natural, real-time conversations with minimal end-to-end latency

Dynamic restaurant recommendations based on user preferences mid-call

Multi-turn context handling, including follow-up questions on specific venues

Telephony integration with existing phone infrastructure

Scalability, supporting multiple concurrent inbound calls without manual intervention

This POC needed to prove that a speech-to-speech AI agent could feel fluid, responsive, and production-viable, without overengineering the stack.

What The Customer Needed

Low-latency voice AI that didn’t feel robotic or laggy

Simplified architecture without chaining multiple speech and LLM services

Fast iteration to validate conversational flows before production investment

Cloud-native scalability with minimal operational overhead

Clear production path, even if the first phase was experimental

The Solution: Azure Voice Live API with Call Center Accelerator

The solution was built using the Azure Call Center Voice Agent Accelerator, powered by the Azure Voice Live API.

Instead of orchestrating separate ASR, LLM, and TTS services, the Voice Live API provided a single, unified speech-to-speech endpoint, significantly reducing latency and architectural complexity.

The POC focused on validating conversational quality, system responsiveness, and operational feasibility under real phone-call conditions.

Architecture Overview

The solution consists of three main components:

Azure Communication Services (ACS) – Handles telephony integration and call routing

Azure Voice Live API – Provides the speech-to-speech engine combining ASR (Automatic Speech Recognition), LLM, and TTS (Text-to-Speech) in a unified interface

Backend Service – Orchestrates the conversation flow and business logic, deployed on Azure Container Apps

When a user calls the service number, ACS routes the call to our backend service, which establishes a connection with the Voice Live API. The API handles the entire speech-to-speech pipeline, eliminating the need to manually chain together separate speech recognition, language model, and synthesis services.

Key Technical Decisions

Why Voice Live API?

Traditional voice agent implementations require orchestrating multiple services: speech-to-text for transcription, a language model for understanding and generation, and text-to-speech for responses. The Voice Live API consolidates these into a single endpoint with optimized latency, which was critical for maintaining natural conversation flow.

Deployment Strategy

We used Azure Developer CLI (azd) for infrastructure provisioning and deployment. This provided:

Consistent deployment across environments

Infrastructure-as-code through Bicep templates

Simplified resource management and teardown

The backend service runs as a containerized application on Azure Container Apps, providing automatic scaling and simplified management.

Testing Approach

The accelerator provides two client interfaces:

A web-based client for rapid testing during development using browser microphone/speaker

The ACS phone client for end-to-end testing with actual phone calls

This dual approach allowed us to iterate quickly during development while ensuring production-like testing before deployment.

Implementation Details

Conversation Flow

The agent follows this interaction pattern:

Initial Greeting – Agent introduces itself and asks about food preferences

Preference Collection – User describes cuisine type or specific preferences

Restaurant Recommendations – Agent provides a curated list of options

Detail Inquiries – User can ask about specific restaurants (menu, hours, location)

Action Completion – User can request table reservation or food delivery

The LLM backing the Voice Live API handles context management across this multi-turn conversation, maintaining awareness of previously mentioned restaurants and user preferences.

Configuration and Customization

The accelerator is designed for flexibility. Key configuration points include:

System prompts to define agent personality and behavior

Speech recognition settings for language and accuracy tuning

TTS voice selection from Azure’s neural voice library

Turn-taking behavior to control when the agent stops listening

For this restaurant use case, we customized the system prompt to include domain knowledge about restaurant types, common questions, and appropriate response formats.

Event Grid Integration

To receive incoming calls, we configured an Azure Event Grid subscription:

Event Type: IncomingCall

Endpoint: https://<container-app-url>/acs/incomingcall

This webhook triggers when someone dials the ACS phone number, initiating the voice agent session.

Deployment Process

The deployment was streamlined through the accelerator template:

# Initialize the project

azd init -t Azure-Samples/call-center-voice-agent-accelerator

# Authenticate to Azure

azd auth login

# Deploy all resources

azd up

The azd up command provisions:

Azure AI Speech Services with Voice Live API access

Azure Communication Services with phone number

Azure Container Apps for hosting the service

Azure Container Registry for image storage

Necessary networking and IAM configurations

We selected swedencentral as the deployment region due to Azure AI Foundry availability requirements.

Business Outcomes

The POC delivered clear, measurable business value beyond technical validation:

Faster Time-to-Validation
The accelerator-based approach allowed the customer to move from concept to live phone-based testing in days, not weeks, significantly reducing experimentation costs.

Lower Engineering Overhead
By avoiding custom orchestration of multiple speech and AI services, the team minimized glue code, reduced failure points, and simplified long-term maintenance.

Improved Customer Experience Potential
Sub-2-second response times enabled natural, interruption-free conversations critical for user adoption in voice-driven services.

Predictable Scaling Model
Usage-based pricing and serverless container scaling provided a clear cost-to-call model, making it easier to forecast production costs tied directly to call volume.

Clear Path to Production
The architecture established a solid foundation for future expansion, including analytics, state persistence, and escalation to human agents, without requiring a redesign.

Production Considerations

To move beyond POC, we identified several next steps:

Persistent conversation state across calls

Analytics and monitoring for intent tracking and agent performance

Fallback mechanisms for misunderstood requests

Load testing for peak call volumes

Regional rollout planning based on service availability

Why It Matters

This case shows that high-quality voice AI experiences don’t require complex, fragile architectures.

Leveraging Azure’s speech-to-speech capabilities and cloud-native services proved it’s possible to deliver natural, scalable voice interactions while keeping engineering effort and operational risk under control.

The result: a validated, production-ready direction for voice-based customer engagement, grounded in real-world performance.

Complete Cloud Command

By Challenge

By Industry

Inside an Enterprise Voice: Building a Restaurant Recommendation Agent with Azure Voice Live API

Industry

Region

Cloud Vendor

Solutions Used

The Challenge

What The Customer Needed

The Solution: Azure Voice Live API with Call Center Accelerator

Architecture Overview

Key Technical Decisions

Implementation Details

Deployment Process

Business Outcomes

Production Considerations

Why It Matters

Ready to start?

You May Also Like

GMT Modernizes AWS for Security, Scalability, and Speed

PayMe Modernizes AWS Infrastructure for Security, Cost, and Scale

Inside Bria AI’s Cloud Strategy: Scaling Visual GenAI with Azure ML

Solutions

Products

Resources

Company