Overview
Voice AI agents combine traditional agent capabilities with text-to-speech (TTS) technology to create immersive, conversational experiences. These agents can process text or voice input and respond with natural-sounding speech, making them ideal for customer support, educational content, and accessibility applications.
All voice agents in this collection use advanced TTS models from OpenAI, ElevenLabs, or Google for high-quality, natural-sounding voice output.
Voice RAG Systems
Voice RAG with OpenAI SDK
A voice-enabled Retrieval-Augmented Generation system that processes PDFs and responds with both text and voice.
Architecture:
PDF Documents
↓
Document Processing
├─ RecursiveCharacterTextSplitter
├─ FastEmbed embeddings
└─ Qdrant vector storage
↓
User Question (Text/Voice)
↓
Retrieval Pipeline
├─ Query embedding
├─ Vector similarity search
└─ Context retrieval
↓
Agent Processing
├─ Processing Agent (answer generation)
└─ TTS Agent (speech optimization)
↓
Voice Generation
├─ OpenAI TTS API
├─ Voice selection
└─ MP3 output
↓
User Response (Text + Audio)
Features:
Document Processing
PDF upload and parsing
Intelligent chunking
FastEmbed embeddings
Qdrant vector storage
Multiple document tracking
Voice Features
Multiple voice options
Real-time text-to-speech
Downloadable MP3 files
Spoken-word optimization
Natural speech patterns
RAG Pipeline
Query embedding
Similarity search
Context-aware responses
Source attribution
Document references
Agent Workflow
Processing agent for answers
TTS agent for speech optimization
Real-time audio streaming
Progress tracking
cd voice_ai_agents/voice_rag_openaisdk
pip install -r requirements.txt
streamlit run rag_voice.py
Setup Requirements:
Create Environment File
# Create .env file
OPENAI_API_KEY = 'your-openai-api-key'
QDRANT_URL = 'your-qdrant-url'
QDRANT_API_KEY = 'your-qdrant-api-key'
Run Application
streamlit run rag_voice.py
Upload and Query
Upload PDF documents
Ask questions about content
Select preferred voice
Listen to or download responses
Voice Selection:
OpenAI provides multiple voice personalities:
alloy - Neutral and balanced
echo - Clear and articulate
fable - Warm and expressive
onyx - Deep and authoritative
nova - Friendly and energetic
shimmer - Soft and pleasant
Implementation Pattern:
from agno import Agent, OpenAI
from agno.storage import QdrantStorage
from agno.embeddings import FastEmbed
# Create processing agent
processing_agent = Agent(
name = "Document Processor" ,
model = OpenAI( id = "gpt-4o" ),
instructions = "Generate clear, spoken-word friendly responses" ,
storage = QdrantStorage( ... )
)
# Create TTS agent
tts_agent = Agent(
name = "TTS Optimizer" ,
model = OpenAI( id = "gpt-4o" ),
instructions = "Optimize text for natural speech synthesis"
)
# Generate and synthesize
response = processing_agent.run(query)
optimized = tts_agent.run(response)
audio = openai.audio.speech.create(
model = "tts-1" ,
voice = "nova" ,
input = optimized
)
The TTS agent optimizes responses for speech by adjusting pacing, adding appropriate pauses, and ensuring natural flow - significantly improving audio quality.
Customer Support Agents
Customer Support Voice Agent
An OpenAI SDK powered customer support agent that delivers voice responses to questions about your knowledge base.
System Architecture:
Documentation Website
↓
Firecrawl API (Web Crawling)
↓
Content Extraction & Processing
↓
Qdrant Vector Database
├─ FastEmbed embeddings
├─ Semantic indexing
└─ Efficient retrieval
↓
User Query
↓
AI Agent Team
├─ Documentation Processor Agent
│ └─ Analyzes docs and generates answers
├─ TTS Agent
│ └─ Optimizes for natural speech
└─ Voice synthesis
↓
Text + Voice Response
Agent Team:
Documentation Processor
TTS Agent
Role: Answer generation from knowledge baseCapabilities:
Analyzes documentation content
Generates clear, concise responses
Provides context from sources
Handles follow-up questions
Maintains conversation context
Tools:
Qdrant vector search
FastEmbed for embeddings
GPT-4o for reasoning
Role: Speech optimizationCapabilities:
Converts text to speech-friendly format
Adds appropriate pacing
Inserts natural pauses
Emphasizes key points
Ensures clarity
Output:
Natural-sounding speech with proper intonation and rhythm
Voice Customization:
Supports 11 OpenAI TTS voices:
alloy - Versatile, neutral tone
ash - Clear and professional
ballad - Expressive storytelling
coral - Warm and welcoming
echo - Articulate and clear
fable - Engaging narrator
onyx - Authoritative presence
nova - Energetic and friendly
sage - Calm and knowledgeable
shimmer - Gentle and soothing
verse - Conversational tone
cd voice_ai_agents/customer_support_voice_agent
pip install -r requirements.txt
streamlit run ai_voice_agent_docs.py
Setup and Usage:
Configure API Keys
Enter in sidebar:
OpenAI API key
Qdrant API key and URL
Firecrawl API key
Initialize Knowledge Base
Input documentation URL
Select preferred voice
Click “Initialize System”
Wait for crawling and indexing
Ask Questions
Type questions about the documentation
Receive text response
Listen to voice response
Download audio if needed
Features in Detail:
Knowledge Base Creation
Crawls documentation websites
Preserves document structure
Stores metadata
Supports up to 5 pages (configurable)
Vector Search
FastEmbed for embeddings
Semantic similarity search
Efficient document retrieval
Context-aware results
Voice Generation
High-quality TTS
Multiple voice options
Natural speech patterns
Proper pacing and emphasis
Interactive Interface
Clean Streamlit UI
Sidebar configuration
Real-time processing
Progress indicators
Audio player with download
Scaling Configuration: The default setup crawls 5 pages. To index more documentation, adjust the max_pages parameter in the Firecrawl configuration.
Audio Tour Agents
Self-Guided AI Audio Tour Agent
A conversational voice agent that generates immersive, self-guided audio tours based on location, interests, and duration.
Multi-Agent Architecture:
Orchestrator Agent Coordinates overall tour flow, manages transitions, and assembles content from expert agents.
History Agent Delivers historical narratives with authoritative voice and detailed context.
Architecture Agent Highlights architectural details, styles, and design elements with technical descriptions.
Culture Agent Explores local customs, traditions, and artistic heritage with enthusiastic tone.
Culinary Agent Describes iconic dishes and food culture with passionate, engaging voice.
Tour Generation Workflow:
User Input
├─ Location (e.g., "Paris, France")
├─ Interests (History, Architecture, Culture, Food)
├─ Duration (15, 30, or 60 minutes)
└─ Custom preferences
↓
Orchestrator Agent Planning
├─ Web search for location information
├─ Time allocation based on interests
├─ Content distribution planning
└─ Transition coordination
↓
Expert Agents Generate Content (Parallel)
├─ History Agent → Historical narratives
├─ Architecture Agent → Building descriptions
├─ Culture Agent → Cultural insights
└─ Culinary Agent → Food experiences
↓
Orchestrator Assembles Tour
├─ Weaves narratives together
├─ Adds transitions
├─ Ensures proper pacing
└─ Balances content by interest weights
↓
Voice Synthesis (GPT-4o Mini Audio)
├─ Different voices for each agent
├─ Natural transitions
├─ Expressive delivery
└─ Appropriate tone per topic
↓
Complete Audio Tour
Features:
Location-Aware
Tour Duration
Voice Quality
Dynamic Content Generation:
Based on user-input location
Real-time web search integration
Up-to-date information
Relevant local details
Personalization:
Filtered by interest categories
Weighted by user preferences
Customized depth of coverage
Time Options:
15 minutes (quick highlights)
30 minutes (balanced coverage)
60 minutes (comprehensive tour)
Time Allocation:
Adapts to user interest weights
Considers location relevance
Ensures well-paced narratives
Proportioned sections
Expressive Speech:
GPT-4o Mini Audio model
Agent-specific voices
Authoritative (History)
Technical (Architecture)
Enthusiastic (Culture)
Passionate (Culinary)
Natural Flow:
Smooth transitions
Appropriate pacing
Emotional variation
cd voice_ai_agents/ai_audio_tour_agent
pip install -r requirements.txt
streamlit run ai_audio_tour_agent.py
Usage Flow:
Enter Location
Specify the destination for your audio tour:
City name (e.g., “Rome”)
Landmark (e.g., “Eiffel Tower”)
Region (e.g., “Tuscany”)
Select Interests
Choose areas of interest:
History and heritage
Architecture and design
Culture and arts
Culinary experiences
Adjust weights for each category
Choose Duration
Select tour length:
15 min (highlights)
30 min (standard)
60 min (comprehensive)
Generate Tour
Agents research and create content
Orchestrator assembles narrative
Voice synthesis generates audio
Download or stream result
Example Tour Structure (30 minutes):
Introduction (2 min)
└─ Orchestrator sets scene
History Section (8 min)
├─ Ancient origins
├─ Key historical events
└─ Notable figures
Architecture Section (7 min)
├─ Iconic buildings
├─ Architectural styles
└─ Design elements
Culture Section (7 min)
├─ Local traditions
├─ Arts and music
└─ Cultural practices
Culinary Section (5 min)
├─ Signature dishes
├─ Food culture
└─ Dining experiences
Conclusion (1 min)
└─ Orchestrator wraps up
Best Practices:
Be specific with location for better results
Adjust interest weights to personalize content
Start with 30-minute tours for balanced experience
Use headphones for immersive experience
Implementation Patterns
Basic Voice Agent
from agno import Agent, OpenAI
import openai
# Create agent with TTS capability
agent = Agent(
name = "Voice Assistant" ,
model = OpenAI( id = "gpt-4o" ),
instructions = "Generate clear, conversational responses"
)
# Generate response
response = agent.run(user_query)
# Convert to speech
audio = openai.audio.speech.create(
model = "tts-1" ,
voice = "nova" ,
input = response.content
)
# Save or stream audio
audio.stream_to_file( "response.mp3" )
Voice RAG Pattern
from agno import Agent
from agno.storage import QdrantStorage
from agno.embeddings import FastEmbed
# RAG agent with voice output
rag_agent = Agent(
name = "Voice RAG" ,
model = OpenAI( id = "gpt-4o" ),
storage = QdrantStorage(
collection = "docs" ,
embedder = FastEmbed()
),
instructions = """
Answer questions using the knowledge base.
Format responses for natural speech:
- Use conversational language
- Add appropriate pauses with commas
- Spell out acronyms on first use
- Keep sentences clear and concise
"""
)
# Query with voice output
response = rag_agent.run(query)
audio = synthesize_speech(response.content)
Multi-Agent Voice System
# Specialized agents with different voices
history_agent = Agent(
name = "Historian" ,
voice = "onyx" , # Deep, authoritative
instructions = "Deliver historical narratives"
)
culture_agent = Agent(
name = "Culture Guide" ,
voice = "nova" , # Friendly, energetic
instructions = "Explore cultural experiences"
)
orchestrator = Agent(
name = "Tour Guide" ,
team = [history_agent, culture_agent],
voice = "alloy" , # Neutral, balanced
instructions = "Coordinate tour narrative"
)
# Generate multi-voice tour
tour = orchestrator.run(tour_request)
Streaming Voice Responses
import asyncio
async def stream_voice_response ( query ):
# Stream text response
async for chunk in agent.run_stream(query):
print (chunk.content, end = "" , flush = True )
text_buffer += chunk.content
# Generate audio for complete sentences
if chunk.content.endswith(( '.' , '!' , '?' )):
audio = await synthesize_async(text_buffer)
await play_audio(audio)
text_buffer = ""
Voice Quality Optimization
Text Optimization for TTS
def optimize_for_speech ( text : str ) -> str :
"""
Optimize text for natural-sounding speech.
"""
# Spell out acronyms
text = text.replace( "AI" , "A.I." )
text = text.replace( "API" , "A.P.I." )
# Add pauses for readability
text = text.replace( ". " , ". ... " )
# Remove markdown formatting
text = re.sub( r ' \*\* | \* | _' , '' , text)
# Convert numbers to words for clarity
text = text.replace( "1st" , "first" )
text = text.replace( "2nd" , "second" )
return text
Voice Selection Guide
Use Case Recommended Voice Characteristics Customer Support nova, coral Friendly, helpful Educational Content sage, alloy Clear, authoritative Storytelling fable, ballad Expressive, engaging Professional onyx, echo Authoritative, clear Casual Conversation shimmer, verse Natural, conversational
Audio Quality Settings
# High-quality audio
audio = openai.audio.speech.create(
model = "tts-1-hd" , # HD quality
voice = "nova" ,
input = text,
response_format = "mp3" ,
speed = 1.0 # Normal speed (0.25-4.0)
)
# Optimize for streaming
audio = openai.audio.speech.create(
model = "tts-1" , # Standard quality, faster
voice = "nova" ,
input = text,
response_format = "opus" # Better for streaming
)
Best Practices
Response Formatting
Use conversational language
Break into short sentences
Add natural pauses with punctuation
Spell out acronyms
Avoid complex markdown
Voice Selection
Match voice to use case
Test multiple options
Consider audience preferences
Use consistent voices for roles
Vary voices in multi-agent systems
Audio Processing
Use HD model for quality
Stream for long content
Cache generated audio
Provide download options
Add playback controls
Error Handling
Handle API failures gracefully
Provide text fallback
Show loading indicators
Timeout long requests
Log audio generation issues
Cost Considerations:
TTS API costs per character
HD models cost more than standard
Cache audio when possible
Monitor usage in production
Consider rate limits
Next Steps
MCP Agents Add external service integration
Multi-Agent Teams Build coordinated agent systems
Advanced Agents Explore sophisticated implementations
Game Playing Try adversarial agent systems