Voice AI Agents - Awesome LLM Apps

Overview

Voice AI agents combine traditional agent capabilities with text-to-speech (TTS) technology to create immersive, conversational experiences. These agents can process text or voice input and respond with natural-sounding speech, making them ideal for customer support, educational content, and accessibility applications.

All voice agents in this collection use advanced TTS models from OpenAI, ElevenLabs, or Google for high-quality, natural-sounding voice output.

Voice RAG Systems

Voice RAG with OpenAI SDK

A voice-enabled Retrieval-Augmented Generation system that processes PDFs and responds with both text and voice. Architecture:

PDF Documents
    ↓
Document Processing
├─ RecursiveCharacterTextSplitter
├─ FastEmbed embeddings
└─ Qdrant vector storage
    ↓
User Question (Text/Voice)
    ↓
Retrieval Pipeline
├─ Query embedding
├─ Vector similarity search
└─ Context retrieval
    ↓
Agent Processing
├─ Processing Agent (answer generation)
└─ TTS Agent (speech optimization)
    ↓
Voice Generation
├─ OpenAI TTS API
├─ Voice selection
└─ MP3 output
    ↓
User Response (Text + Audio)

Features:

Document Processing

PDF upload and parsing
Intelligent chunking
FastEmbed embeddings
Qdrant vector storage
Multiple document tracking

Voice Features

Multiple voice options
Real-time text-to-speech
Downloadable MP3 files
Spoken-word optimization
Natural speech patterns

RAG Pipeline

Query embedding
Similarity search
Context-aware responses
Source attribution
Document references

Agent Workflow

Processing agent for answers
TTS agent for speech optimization
Real-time audio streaming
Progress tracking

cd voice_ai_agents/voice_rag_openaisdk
pip install -r requirements.txt
streamlit run rag_voice.py

Setup Requirements:

Get API Keys

OpenAI API key: https://platform.openai.com/
Qdrant Cloud credentials: https://cloud.qdrant.io/

Create Environment File

# Create .env file
OPENAI_API_KEY='your-openai-api-key'
QDRANT_URL='your-qdrant-url'
QDRANT_API_KEY='your-qdrant-api-key'

Run Application

streamlit run rag_voice.py

Upload and Query

Upload PDF documents
Ask questions about content
Select preferred voice
Listen to or download responses

Voice Selection: OpenAI provides multiple voice personalities:

alloy - Neutral and balanced
echo - Clear and articulate
fable - Warm and expressive
onyx - Deep and authoritative
nova - Friendly and energetic
shimmer - Soft and pleasant

Implementation Pattern:

from agno import Agent, OpenAI
from agno.storage import QdrantStorage
from agno.embeddings import FastEmbed

# Create processing agent
processing_agent = Agent(
    name="Document Processor",
    model=OpenAI(id="gpt-4o"),
    instructions="Generate clear, spoken-word friendly responses",
    storage=QdrantStorage(...)
)

# Create TTS agent
tts_agent = Agent(
    name="TTS Optimizer",
    model=OpenAI(id="gpt-4o"),
    instructions="Optimize text for natural speech synthesis"
)

# Generate and synthesize
response = processing_agent.run(query)
optimized = tts_agent.run(response)
audio = openai.audio.speech.create(
    model="tts-1",
    voice="nova",
    input=optimized
)

The TTS agent optimizes responses for speech by adjusting pacing, adding appropriate pauses, and ensuring natural flow - significantly improving audio quality.

Customer Support Agents

Customer Support Voice Agent

An OpenAI SDK powered customer support agent that delivers voice responses to questions about your knowledge base. System Architecture:

Documentation Website
    ↓
Firecrawl API (Web Crawling)
    ↓
Content Extraction & Processing
    ↓
Qdrant Vector Database
├─ FastEmbed embeddings
├─ Semantic indexing
└─ Efficient retrieval
    ↓
User Query
    ↓
AI Agent Team
├─ Documentation Processor Agent
│   └─ Analyzes docs and generates answers
├─ TTS Agent
│   └─ Optimizes for natural speech
└─ Voice synthesis
    ↓
Text + Voice Response

Agent Team:

Documentation Processor
TTS Agent

Role: Answer generation from knowledge baseCapabilities:

Analyzes documentation content
Generates clear, concise responses
Provides context from sources
Handles follow-up questions
Maintains conversation context

Tools:

Qdrant vector search
FastEmbed for embeddings
GPT-4o for reasoning

Voice Customization: Supports 11 OpenAI TTS voices:

alloy - Versatile, neutral tone
ash - Clear and professional
ballad - Expressive storytelling
coral - Warm and welcoming
echo - Articulate and clear
fable - Engaging narrator
onyx - Authoritative presence
nova - Energetic and friendly
sage - Calm and knowledgeable
shimmer - Gentle and soothing
verse - Conversational tone

cd voice_ai_agents/customer_support_voice_agent
pip install -r requirements.txt
streamlit run ai_voice_agent_docs.py

Setup and Usage:

Configure API Keys

Enter in sidebar:

OpenAI API key
Qdrant API key and URL
Firecrawl API key

Initialize Knowledge Base

Input documentation URL
Select preferred voice
Click “Initialize System”
Wait for crawling and indexing

Ask Questions

Type questions about the documentation
Receive text response
Listen to voice response
Download audio if needed

Features in Detail:

Knowledge Base Creation

Crawls documentation websites
Preserves document structure
Stores metadata
Supports up to 5 pages (configurable)

Vector Search

FastEmbed for embeddings
Semantic similarity search
Efficient document retrieval
Context-aware results

Voice Generation

High-quality TTS
Multiple voice options
Natural speech patterns
Proper pacing and emphasis

Interactive Interface

Clean Streamlit UI
Sidebar configuration
Real-time processing
Progress indicators
Audio player with download

Scaling Configuration: The default setup crawls 5 pages. To index more documentation, adjust the max_pages parameter in the Firecrawl configuration.

Audio Tour Agents

Self-Guided AI Audio Tour Agent

A conversational voice agent that generates immersive, self-guided audio tours based on location, interests, and duration. Multi-Agent Architecture:

Orchestrator Agent

Coordinates overall tour flow, manages transitions, and assembles content from expert agents.

History Agent

Delivers historical narratives with authoritative voice and detailed context.

Architecture Agent

Highlights architectural details, styles, and design elements with technical descriptions.

Culture Agent

Explores local customs, traditions, and artistic heritage with enthusiastic tone.

Culinary Agent

Describes iconic dishes and food culture with passionate, engaging voice.

Tour Generation Workflow:

User Input
├─ Location (e.g., "Paris, France")
├─ Interests (History, Architecture, Culture, Food)
├─ Duration (15, 30, or 60 minutes)
└─ Custom preferences
    ↓
Orchestrator Agent Planning
├─ Web search for location information
├─ Time allocation based on interests
├─ Content distribution planning
└─ Transition coordination
    ↓
Expert Agents Generate Content (Parallel)
├─ History Agent → Historical narratives
├─ Architecture Agent → Building descriptions
├─ Culture Agent → Cultural insights
└─ Culinary Agent → Food experiences
    ↓
Orchestrator Assembles Tour
├─ Weaves narratives together
├─ Adds transitions
├─ Ensures proper pacing
└─ Balances content by interest weights
    ↓
Voice Synthesis (GPT-4o Mini Audio)
├─ Different voices for each agent
├─ Natural transitions
├─ Expressive delivery
└─ Appropriate tone per topic
    ↓
Complete Audio Tour

Features:

Location-Aware
Tour Duration
Voice Quality

Dynamic Content Generation:

Based on user-input location
Real-time web search integration
Up-to-date information
Relevant local details

Personalization:

Filtered by interest categories
Weighted by user preferences
Customized depth of coverage

cd voice_ai_agents/ai_audio_tour_agent
pip install -r requirements.txt
streamlit run ai_audio_tour_agent.py

Usage Flow:

Enter Location

Specify the destination for your audio tour:

City name (e.g., “Rome”)
Landmark (e.g., “Eiffel Tower”)
Region (e.g., “Tuscany”)

Select Interests

Choose areas of interest:

History and heritage
Architecture and design
Culture and arts
Culinary experiences

Adjust weights for each category

Choose Duration

Select tour length:

15 min (highlights)
30 min (standard)
60 min (comprehensive)

Generate Tour

Agents research and create content
Orchestrator assembles narrative
Voice synthesis generates audio
Download or stream result

Example Tour Structure (30 minutes):

Introduction (2 min)
└─ Orchestrator sets scene

History Section (8 min)
├─ Ancient origins
├─ Key historical events
└─ Notable figures

Architecture Section (7 min)
├─ Iconic buildings
├─ Architectural styles
└─ Design elements

Culture Section (7 min)
├─ Local traditions
├─ Arts and music
└─ Cultural practices

Culinary Section (5 min)
├─ Signature dishes
├─ Food culture
└─ Dining experiences

Conclusion (1 min)
└─ Orchestrator wraps up

Best Practices:

Be specific with location for better results
Adjust interest weights to personalize content
Start with 30-minute tours for balanced experience
Use headphones for immersive experience

Implementation Patterns

Basic Voice Agent

from agno import Agent, OpenAI
import openai

# Create agent with TTS capability
agent = Agent(
    name="Voice Assistant",
    model=OpenAI(id="gpt-4o"),
    instructions="Generate clear, conversational responses"
)

# Generate response
response = agent.run(user_query)

# Convert to speech
audio = openai.audio.speech.create(
    model="tts-1",
    voice="nova",
    input=response.content
)

# Save or stream audio
audio.stream_to_file("response.mp3")

Voice RAG Pattern

from agno import Agent
from agno.storage import QdrantStorage
from agno.embeddings import FastEmbed

# RAG agent with voice output
rag_agent = Agent(
    name="Voice RAG",
    model=OpenAI(id="gpt-4o"),
    storage=QdrantStorage(
        collection="docs",
        embedder=FastEmbed()
    ),
    instructions="""
    Answer questions using the knowledge base.
    Format responses for natural speech:
    - Use conversational language
    - Add appropriate pauses with commas
    - Spell out acronyms on first use
    - Keep sentences clear and concise
    """
)

# Query with voice output
response = rag_agent.run(query)
audio = synthesize_speech(response.content)

Multi-Agent Voice System

# Specialized agents with different voices
history_agent = Agent(
    name="Historian",
    voice="onyx",  # Deep, authoritative
    instructions="Deliver historical narratives"
)

culture_agent = Agent(
    name="Culture Guide",
    voice="nova",  # Friendly, energetic
    instructions="Explore cultural experiences"
)

orchestrator = Agent(
    name="Tour Guide",
    team=[history_agent, culture_agent],
    voice="alloy",  # Neutral, balanced
    instructions="Coordinate tour narrative"
)

# Generate multi-voice tour
tour = orchestrator.run(tour_request)

Streaming Voice Responses

import asyncio

async def stream_voice_response(query):
    # Stream text response
    async for chunk in agent.run_stream(query):
        print(chunk.content, end="", flush=True)
        text_buffer += chunk.content
        
        # Generate audio for complete sentences
        if chunk.content.endswith(('.', '!', '?')):
            audio = await synthesize_async(text_buffer)
            await play_audio(audio)
            text_buffer = ""

Voice Quality Optimization

Text Optimization for TTS

def optimize_for_speech(text: str) -> str:
    """
    Optimize text for natural-sounding speech.
    """
    # Spell out acronyms
    text = text.replace("AI", "A.I.")
    text = text.replace("API", "A.P.I.")
    
    # Add pauses for readability
    text = text.replace(". ", ". ... ")
    
    # Remove markdown formatting
    text = re.sub(r'\*\*|\*|_', '', text)
    
    # Convert numbers to words for clarity
    text = text.replace("1st", "first")
    text = text.replace("2nd", "second")
    
    return text

Voice Selection Guide

Use Case	Recommended Voice	Characteristics
Customer Support	nova, coral	Friendly, helpful
Educational Content	sage, alloy	Clear, authoritative
Storytelling	fable, ballad	Expressive, engaging
Professional	onyx, echo	Authoritative, clear
Casual Conversation	shimmer, verse	Natural, conversational

Audio Quality Settings

# High-quality audio
audio = openai.audio.speech.create(
    model="tts-1-hd",  # HD quality
    voice="nova",
    input=text,
    response_format="mp3",
    speed=1.0  # Normal speed (0.25-4.0)
)

# Optimize for streaming
audio = openai.audio.speech.create(
    model="tts-1",  # Standard quality, faster
    voice="nova",
    input=text,
    response_format="opus"  # Better for streaming
)

Best Practices

Response Formatting

Use conversational language
Break into short sentences
Add natural pauses with punctuation
Spell out acronyms
Avoid complex markdown

Voice Selection

Match voice to use case
Test multiple options
Consider audience preferences
Use consistent voices for roles
Vary voices in multi-agent systems

Audio Processing

Use HD model for quality
Stream for long content
Cache generated audio
Provide download options
Add playback controls

Error Handling

Handle API failures gracefully
Provide text fallback
Show loading indicators
Timeout long requests
Log audio generation issues

Cost Considerations:

TTS API costs per character
HD models cost more than standard
Cache audio when possible
Monitor usage in production
Consider rate limits

Next Steps

MCP Agents

Add external service integration

Multi-Agent Teams

Build coordinated agent systems

Advanced Agents

Explore sophisticated implementations

Game Playing

Try adversarial agent systems

Get Started

AI Agents

RAG Applications

Advanced Concepts

Agent Skills

Framework Guides

​Overview

​Voice RAG Systems

​Voice RAG with OpenAI SDK

Document Processing

Voice Features

RAG Pipeline

Agent Workflow

​Customer Support Agents

​Customer Support Voice Agent

Knowledge Base Creation

Vector Search

Voice Generation

Interactive Interface

​Audio Tour Agents

​Self-Guided AI Audio Tour Agent

Orchestrator Agent

History Agent

Architecture Agent

Culture Agent

Culinary Agent

​Implementation Patterns

​Basic Voice Agent

​Voice RAG Pattern

​Multi-Agent Voice System

​Streaming Voice Responses

​Voice Quality Optimization

​Text Optimization for TTS

​Voice Selection Guide

​Audio Quality Settings

​Best Practices

Response Formatting

Voice Selection

Audio Processing

Error Handling

​Next Steps

MCP Agents

Multi-Agent Teams

Advanced Agents

Game Playing

Build docs developers (and LLMs) love

Overview

Voice RAG Systems

Voice RAG with OpenAI SDK

Customer Support Agents

Customer Support Voice Agent

Audio Tour Agents

Self-Guided AI Audio Tour Agent

Implementation Patterns

Basic Voice Agent

Voice RAG Pattern

Multi-Agent Voice System

Streaming Voice Responses

Voice Quality Optimization

Text Optimization for TTS

Voice Selection Guide

Audio Quality Settings

Best Practices

Next Steps