OpenAI Realtime API - Agentic AI

Overview

The OpenAI Realtime API powers Agentic AI’s voice conversations with:

Low-latency audio streaming - Bidirectional PCM audio via WebSocket
Built-in Whisper STT - Accurate speech transcription including proper nouns
Natural TTS voices - Six voice options for assistant responses
Server-side VAD - Automatic turn detection

This integration replaces Gemini Live API for better transcription accuracy, especially with proper nouns like names and places.

Architecture

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Twilio    │────▶│   Audio Bridge   │────▶│  OpenAI Realtime│
│  (μ-law)    │     │  (PCM converter) │     │  (gpt-4o)       │
└─────────────┘     └──────────────────┘     └─────────────────┘
                           │                          │
                           ▼                          ▼
                    Conversation Brain        Whisper Transcripts
                    (Intent Analysis)        (Accurate proper nouns)

Audio Flow

User speaks → Twilio captures audio
μ-law → PCM → Audio bridge converts to PCM 24kHz
Send to OpenAI → Streamed via WebSocket
Whisper transcription → Accurate text with proper nouns
AI responds → GPT-4o generates speech
Receive audio → PCM 24kHz from OpenAI
PCM → μ-law → Converted back for Twilio
User hears → AI response on phone

Getting an API Key

Create OpenAI account

Add payment method

The Realtime API is not available on free tier:

Go to Settings → Billing
Add a payment method
Set a usage limit (e.g., $10/month)

Generate API key

Navigate to platform.openai.com/api-keys:

Click Create new secret key
Name it (e.g., “Agentic AI”)
Copy the key (starts with sk-proj-...)
Store securely - it won’t be shown again

Check access

Verify you have Realtime API access:

curl https://api.openai.com/v1/models/gpt-4o-realtime-preview-2024-12-17 \
  -H "Authorization: Bearer YOUR_API_KEY"

Should return model details (not an error).

Configuration

Add your OpenAI API key to .env:

.env

# OpenAI Configuration
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Configure the Realtime API in config.yaml:

config.yaml

openai_realtime:
  enabled: true
  api_key: ${OPENAI_API_KEY}
  model: "gpt-4o-realtime-preview-2024-12-17"
  voice: "alloy"  # Options: alloy, echo, fable, onyx, nova, shimmer

# Disable standalone Whisper (Realtime includes it)
whisper:
  enabled: false

Voice Options

The Realtime API offers six voice presets:

Voice	Description	Best For
alloy	Neutral, balanced tone	General purpose, professional
echo	Warm, conversational	Friendly customer service
fable	Expressive, storytelling	Engaging narratives
onyx	Deep, authoritative	Formal announcements
nova	Friendly, upbeat	Enthusiastic interactions
shimmer	Soft, gentle	Calm, soothing conversations

Changing Voice

Update config.yaml and restart:

config.yaml

openai_realtime:
  voice: "nova"  # Switch to friendly, upbeat voice

Or set dynamically in code:

from agenticai.openai.realtime_handler import OpenAIRealtimeHandler

handler = OpenAIRealtimeHandler(
    api_key=api_key,
    voice="echo",  # Warm, conversational
    system_instruction="You are a helpful assistant."
)

System Instructions

Customize the AI’s behavior with system instructions:

config.yaml

openai_realtime:
  system_instruction: |
    You are a professional AI assistant making phone calls.
    
    Guidelines:
    - Be concise and clear (phone conversations)
    - Listen carefully before responding
    - Ask clarifying questions when needed
    - Summarize key points at the end
    - Stay professional and friendly

API Reference

OpenAIRealtimeHandler

Location: src/agenticai/openai/realtime_handler.py:17

class OpenAIRealtimeHandler:
    def __init__(
        self,
        api_key: str,
        model: str = "gpt-4o-realtime-preview-2024-12-17",
        voice: str = "alloy",
        system_instruction: str = "You are a helpful AI assistant.",
    ):
        """Initialize OpenAI Realtime handler."""

    def set_callbacks(
        self,
        on_audio: Callable[[bytes], Awaitable[None]] | None = None,
        on_transcript: Callable[[str, bool], Awaitable[None]] | None = None,
        on_user_transcript: Callable[[str], Awaitable[None]] | None = None,
        on_turn_complete: Callable[[], Awaitable[None]] | None = None,
    ):
        """Set event callbacks."""

    async def connect(self, initial_prompt: str | None = None):
        """Connect to OpenAI Realtime API."""

    async def send_audio(self, audio_data: bytes):
        """Send PCM 16-bit audio to OpenAI.
        
        Args:
            audio_data: PCM 16-bit 24kHz audio bytes
        """

    async def send_text(self, text: str, end_of_turn: bool = True):
        """Send text for OpenAI to respond to.
        
        Used to inject ClawdBot responses for the AI to speak.
        """

    async def get_audio(self) -> bytes:
        """Get audio chunk from OpenAI.
        
        Returns:
            PCM 16-bit 24kHz audio bytes
        """

    async def disconnect():
        """Disconnect from OpenAI Realtime API."""

Usage Example

import asyncio
from agenticai.openai.realtime_handler import OpenAIRealtimeHandler

async def main():
    handler = OpenAIRealtimeHandler(
        api_key="sk-proj-xxx",
        voice="alloy",
    )

    # Set callbacks
    handler.set_callbacks(
        on_audio=lambda audio: print(f"Received {len(audio)} bytes"),
        on_user_transcript=lambda text: print(f"User: {text}"),
        on_transcript=lambda text, is_final: print(f"AI: {text}"),
    )

    # Connect
    await handler.connect(initial_prompt="Hello! How can I help you today?")

    # Send audio chunks
    while True:
        audio_chunk = await get_audio_from_microphone()  # Your audio source
        await handler.send_audio(audio_chunk)

        # Receive AI audio
        try:
            ai_audio = await asyncio.wait_for(handler.get_audio(), timeout=0.1)
            await play_audio(ai_audio)  # Your audio output
        except asyncio.TimeoutError:
            pass

asyncio.run(main())

Event Types

The handler emits several event types:

Audio Events

async def on_audio(audio_bytes: bytes):
    """Called when AI generates audio.
    
    Args:
        audio_bytes: PCM 16-bit 24kHz audio
    """

Transcript Events

async def on_transcript(text: str, is_final: bool):
    """Called when AI speaks (transcript of audio).
    
    Args:
        text: Transcript text
        is_final: True if complete, False if partial
    """

async def on_user_transcript(text: str):
    """Called when user speech is transcribed (Whisper).
    
    Args:
        text: Accurate transcription with proper nouns
    """

Turn Events

async def on_turn_complete():
    """Called when AI finishes speaking."""

async def on_user_turn_complete():
    """Called when user finishes speaking (VAD detected)."""

Session Configuration

The handler automatically configures sessions with optimal settings: Location: src/agenticai/openai/realtime_handler.py:108

config = {
    "type": "session.update",
    "session": {
        "modalities": ["text", "audio"],
        "instructions": system_instruction,
        "voice": voice,
        "input_audio_format": "pcm16",      # 16-bit PCM
        "output_audio_format": "pcm16",     # 16-bit PCM
        "input_audio_transcription": {
            "model": "whisper-1"              # Built-in Whisper
        },
        "turn_detection": {
            "type": "server_vad",             # Server-side VAD
            "threshold": 0.6,                 # Sensitivity
            "prefix_padding_ms": 200,         # Pre-speech buffer
            "silence_duration_ms": 300,       # End-of-speech timeout
        },
    }
}

VAD (Voice Activity Detection)

Server-side VAD automatically detects when the user starts and stops speaking:

Configuration

threshold: 0.6 (higher = less sensitive)
prefix_padding_ms: 200 (captures 200ms before speech)
silence_duration_ms: 300 (300ms silence triggers turn end)

Events

User starts speaking  → input_audio_buffer.speech_started
User stops speaking   → input_audio_buffer.speech_stopped
Transcription ready   → conversation.item.input_audio_transcription.completed

Whisper Transcription

The Realtime API includes Whisper STT with superior accuracy:

Advantages

Proper nouns - Correctly transcribes names, places, brands
Low latency - Transcription arrives with audio chunks
No extra cost - Included in Realtime API pricing
No setup - Automatically enabled

Example Accuracy

User Says	Gemini Native	Whisper (Realtime)
“Call John Smith"	"Call john smith"	"Call John Smith"
"Open Spotify"	"Open spotify"	"Open Spotify"
"Email to [email protected]"	"Email to is tickle all"	"Email to [email protected]”

Implementation

Location: src/agenticai/openai/realtime_handler.py:228

elif event_type == "conversation.item.input_audio_transcription.completed":
    # User transcript (Whisper)
    transcript = event.get("transcript", "")
    if transcript:
        print(f"=== OPENAI USER TRANSCRIPT: {transcript} ===")
        if self._on_user_transcript:
            await self._on_user_transcript(transcript)

Cost and Pricing

Realtime API Pricing

Audio Input: $0.06 / minute
Audio Output: $0.24 / minute
Text Input: $5.00 / 1M tokens
Text Output: $20.00 / 1M tokens

Example Costs

Call Duration	Input Cost	Output Cost	Total
1 minute	$0.06	$0.24	$0.30
5 minutes	$0.30	$1.20	$1.50
10 minutes	$0.60	$2.40	$3.00

Cost Optimization

Use shorter system instructions

System instructions are sent with every request:

# Bad (verbose)
system_instruction: |
  You are a highly professional AI assistant who specializes in...
  [500 words]

# Good (concise)
system_instruction: |
  Professional AI assistant. Be concise and helpful.

Batch audio chunks

The handler automatically batches up to 10 audio chunks per API call:Location: src/agenticai/openai/realtime_handler.py:169

# Batch multiple chunks if available
audio_buffer = audio_data
batch_count = 1
while batch_count < 10:
    try:
        extra = self.audio_out_queue.get_nowait()
        audio_buffer += extra
        batch_count += 1
    except asyncio.QueueEmpty:
        break

Monitor usage

Check usage at platform.openai.com/usageSet up billing alerts:

Go to Settings → Billing → Limits
Set soft limit (e.g., $10)
Set hard limit (e.g., $20)

Troubleshooting

Connection fails

Check API key

# Verify key format
echo $OPENAI_API_KEY  # Should start with sk-proj-

# Test authentication
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

Verify Realtime access

Free tier does not include Realtime API. Check:

Go to Settings → Billing
Verify payment method is added
Check usage limits are set

No audio from AI

Check WebSocket connection

agenticai service logs -f | grep "OPENAI"

Should see:

=== OPENAI: Connected to Realtime API ===
=== OPENAI: First audio chunk received! ===

Verify audio format

OpenAI expects PCM 16-bit. Check conversion:Location: src/agenticai/audio/converter.py:1

# Twilio μ-law → PCM for OpenAI
pcm_audio = convert_mulaw_to_pcm(mulaw_audio)
await openai_handler.send_audio(pcm_audio)

Transcription inaccurate

Use Whisper transcripts

Ensure you’re using Whisper transcripts, not Gemini:

handler.set_callbacks(
    on_user_transcript=handle_whisper_transcript,  # Accurate
)

Check logs:

=== OPENAI USER TRANSCRIPT: ... ===  # Good (Whisper)
=== GEMINI USER TRANSCRIPT: ... ===  # Less accurate

High latency

Check audio batching

The handler batches chunks for efficiency. Monitor logs:

agenticai service logs -f | grep "Sent.*audio chunks"

Should see batching:

=== OPENAI: Sent 100 audio chunks ===

Adjust VAD timing

Reduce silence detection for faster responses:Location: src/agenticai/openai/realtime_handler.py:122

"turn_detection": {
    "silence_duration_ms": 200,  # Faster (was 300)
}

Next Steps

Conversation Brain

Understand intent analysis

Gemini Integration

Alternative AI for intent detection

Audio Pipeline

Learn audio processing flow

Voice Options

Explore all voice configurations

Get Started

Core Concepts

Configuration

Usage Guides

Integrations

Troubleshooting

​Overview

​Architecture

​Audio Flow

​Getting an API Key

​Configuration

​Voice Options

​Changing Voice

​System Instructions

​API Reference

​OpenAIRealtimeHandler

​Usage Example

​Event Types

​Audio Events

​Transcript Events

​Turn Events

​Session Configuration

​VAD (Voice Activity Detection)

​Configuration

​Events

​Whisper Transcription

​Advantages

​Example Accuracy

​Implementation

​Cost and Pricing

​Realtime API Pricing

​Example Costs

​Cost Optimization

​Troubleshooting

​Connection fails

​No audio from AI

​Transcription inaccurate

​High latency

​Next Steps

Conversation Brain

Gemini Integration

Audio Pipeline

Voice Options

Build docs developers (and LLMs) love

Overview

Architecture

Audio Flow

Getting an API Key

Configuration

Voice Options

Changing Voice

System Instructions

API Reference

OpenAIRealtimeHandler

Usage Example

Event Types

Audio Events

Transcript Events

Turn Events

Session Configuration

VAD (Voice Activity Detection)

Configuration

Events

Whisper Transcription

Advantages

Example Accuracy

Implementation

Cost and Pricing

Realtime API Pricing

Example Costs

Cost Optimization

Troubleshooting

Connection fails

No audio from AI

Transcription inaccurate

High latency

Next Steps