Overview
The OpenAI Realtime API powers Agentic AI’s voice conversations with:- Low-latency audio streaming - Bidirectional PCM audio via WebSocket
- Built-in Whisper STT - Accurate speech transcription including proper nouns
- Natural TTS voices - Six voice options for assistant responses
- Server-side VAD - Automatic turn detection
Architecture
Audio Flow
- User speaks → Twilio captures audio
- μ-law → PCM → Audio bridge converts to PCM 24kHz
- Send to OpenAI → Streamed via WebSocket
- Whisper transcription → Accurate text with proper nouns
- AI responds → GPT-4o generates speech
- Receive audio → PCM 24kHz from OpenAI
- PCM → μ-law → Converted back for Twilio
- User hears → AI response on phone
Getting an API Key
Create OpenAI account
Sign up at platform.openai.com
Add payment method
The Realtime API is not available on free tier:
- Go to Settings → Billing
- Add a payment method
- Set a usage limit (e.g., $10/month)
Generate API key
Navigate to platform.openai.com/api-keys:
- Click Create new secret key
- Name it (e.g., “Agentic AI”)
- Copy the key (starts with
sk-proj-...) - Store securely - it won’t be shown again
Configuration
Add your OpenAI API key to.env:
.env
config.yaml:
config.yaml
Voice Options
The Realtime API offers six voice presets:| Voice | Description | Best For |
|---|---|---|
| alloy | Neutral, balanced tone | General purpose, professional |
| echo | Warm, conversational | Friendly customer service |
| fable | Expressive, storytelling | Engaging narratives |
| onyx | Deep, authoritative | Formal announcements |
| nova | Friendly, upbeat | Enthusiastic interactions |
| shimmer | Soft, gentle | Calm, soothing conversations |
Changing Voice
Updateconfig.yaml and restart:
config.yaml
System Instructions
Customize the AI’s behavior with system instructions:config.yaml
API Reference
OpenAIRealtimeHandler
Location:src/agenticai/openai/realtime_handler.py:17
Usage Example
Event Types
The handler emits several event types:Audio Events
Transcript Events
Turn Events
Session Configuration
The handler automatically configures sessions with optimal settings: Location:src/agenticai/openai/realtime_handler.py:108
VAD (Voice Activity Detection)
Server-side VAD automatically detects when the user starts and stops speaking:Configuration
- threshold:
0.6(higher = less sensitive) - prefix_padding_ms:
200(captures 200ms before speech) - silence_duration_ms:
300(300ms silence triggers turn end)
Events
Whisper Transcription
The Realtime API includes Whisper STT with superior accuracy:Advantages
- Proper nouns - Correctly transcribes names, places, brands
- Low latency - Transcription arrives with audio chunks
- No extra cost - Included in Realtime API pricing
- No setup - Automatically enabled
Example Accuracy
| User Says | Gemini Native | Whisper (Realtime) |
|---|---|---|
| “Call John Smith" | "Call john smith" | "Call John Smith" |
| "Open Spotify" | "Open spotify" | "Open Spotify" |
| "Email to [email protected]" | "Email to is tickle all" | "Email to [email protected]” |
Implementation
Location:src/agenticai/openai/realtime_handler.py:228
Cost and Pricing
Realtime API Pricing
- Audio Input: $0.06 / minute
- Audio Output: $0.24 / minute
- Text Input: $5.00 / 1M tokens
- Text Output: $20.00 / 1M tokens
Example Costs
| Call Duration | Input Cost | Output Cost | Total |
|---|---|---|---|
| 1 minute | $0.06 | $0.24 | $0.30 |
| 5 minutes | $0.30 | $1.20 | $1.50 |
| 10 minutes | $0.60 | $2.40 | $3.00 |
Cost Optimization
Use shorter system instructions
Use shorter system instructions
System instructions are sent with every request:
Batch audio chunks
Batch audio chunks
The handler automatically batches up to 10 audio chunks per API call:Location:
src/agenticai/openai/realtime_handler.py:169Monitor usage
Monitor usage
Check usage at platform.openai.com/usageSet up billing alerts:
- Go to Settings → Billing → Limits
- Set soft limit (e.g., $10)
- Set hard limit (e.g., $20)
Troubleshooting
Connection fails
Check API key
Check API key
Verify Realtime access
Verify Realtime access
Free tier does not include Realtime API. Check:
- Go to Settings → Billing
- Verify payment method is added
- Check usage limits are set
No audio from AI
Check WebSocket connection
Check WebSocket connection
Verify audio format
Verify audio format
OpenAI expects PCM 16-bit. Check conversion:Location:
src/agenticai/audio/converter.py:1Transcription inaccurate
Use Whisper transcripts
Use Whisper transcripts
Ensure you’re using Whisper transcripts, not Gemini:Check logs:
High latency
Check audio batching
Check audio batching
The handler batches chunks for efficiency. Monitor logs:Should see batching:
Adjust VAD timing
Adjust VAD timing
Reduce silence detection for faster responses:Location:
src/agenticai/openai/realtime_handler.py:122Next Steps
Conversation Brain
Understand intent analysis
Gemini Integration
Alternative AI for intent detection
Audio Pipeline
Learn audio processing flow
Voice Options
Explore all voice configurations