Overview
LLM API costs are directly tied to token count. These optimization tools help you reduce token usage while maintaining accuracy, enabling cost-effective AI applications at scale.
TOON Format 63.9% average token reduction for structured data
Headroom 47-92% token savings through intelligent compression
Why Optimize?
Cost Savings
Performance
Scale
Impact on API Costs Based on GPT-4 pricing ($0.03/1K input tokens): Usage Volume Standard Cost With Optimization (60% reduction) Savings 1,000 calls $2.55 $1.02 $1.53 100,000 calls $255.00 $102.00 $153.00 1M calls $2,550.00 $1,020.00 $1,530.00 10M calls $25,500.00 $10,200.00 $15,300.00
Speed & Efficiency Benefits
Faster Processing : Fewer tokens = shorter generation time
Reduced Latency : Smaller payloads transmit faster
Better Throughput : Process more requests per second
Context Window : Fit more information in limited windows
Production at Scale For a production app with 100K daily requests:
Monthly savings : $4,590
Yearly savings : $55,080
3-year savings : $165,240
Optimization pays for itself immediately at scale.
Toonify Token Optimization
Reduce token usage by 30-73% using TOON (Token-Oriented Object Notation) format
What is TOON?
TOON is a compact serialization format designed specifically for LLM token efficiency. It achieves CSV-like compression while maintaining structure and readability.
Key Benefits
63.9% Average Reduction Verified across 50 real-world datasets
73.4% for Tabular Data Optimal for structured, uniform data
Human Readable Still easy to understand and debug
<1ms Overhead Negligible conversion time
JSON (247 bytes, 85 tokens)
TOON (98 bytes, 39 tokens - 60% reduction)
{
"products" : [
{ "id" : 101 , "name" : "Laptop Pro" , "price" : 1299 },
{ "id" : 102 , "name" : "Magic Mouse" , "price" : 79 },
{ "id" : 103 , "name" : "USB-C Cable" , "price" : 19 }
]
}
Token Savings : 85 → 39 tokens (54.1% reduction)Cost Impact : 2.55 / 1 K r e q u e s t s → 2.55/1K requests → 2.55/1 Kre q u es t s → 1.17/1K requests
Implementation
Convert Data to TOON
from toon import encode, decode
import json
# Your structured data
data = {
"products" : [
{ "id" : 1 , "name" : "Laptop" , "price" : 1299 , "stock" : 45 },
{ "id" : 2 , "name" : "Mouse" , "price" : 79 , "stock" : 120 },
]
}
# Convert to TOON format
toon_str = encode(data)
print ( f "JSON: { len (json.dumps(data)) } bytes" )
print ( f "TOON: { len (toon_str) } bytes" )
Send to LLM
from openai import OpenAI
client = OpenAI()
# Use TOON format in prompt
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : f "Analyze this product data: \n\n { toon_str } "
}]
)
print (response.choices[ 0 ].message.content)
Decode if Needed
# Convert back to Python objects
original_data = decode(toon_str)
assert original_data == data # Roundtrip verification
Real-World Example
E-commerce Product Analysis
from toon import encode
from openai import OpenAI
import json
client = OpenAI()
# Sample product catalog (could be 100s of products)
products = [
{ "id" : 1 , "name" : "Laptop Pro" , "price" : 1299 , "stock" : 45 , "category" : "Electronics" },
{ "id" : 2 , "name" : "Magic Mouse" , "price" : 79 , "stock" : 120 , "category" : "Accessories" },
{ "id" : 3 , "name" : "USB-C Hub" , "price" : 49 , "stock" : 200 , "category" : "Accessories" },
{ "id" : 4 , "name" : "4K Monitor" , "price" : 599 , "stock" : 30 , "category" : "Electronics" },
{ "id" : 5 , "name" : "Keyboard" , "price" : 129 , "stock" : 85 , "category" : "Accessories" },
# ... potentially hundreds more
]
# Measure token reduction
json_str = json.dumps(products)
toon_str = encode(products)
print ( f "JSON size: { len (json_str) } bytes" )
print ( f "TOON size: { len (toon_str) } bytes" )
print ( f "Reduction: { (( len (json_str) - len (toon_str)) / len (json_str) * 100 ) :.1f} %" )
# Send optimized data to LLM
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : f """
Analyze this product catalog and provide:
1. Total inventory value
2. Low stock items (< 50 units)
3. Average price by category
Data:
{ toon_str }
"""
}]
)
print (response.choices[ 0 ].message.content)
Results:
JSON: 487 bytes, ~165 tokens
TOON: 186 bytes, ~68 tokens
Reduction: 58.8% tokens, 61.8% bytes
Best Use Cases
Optimal for:
Product catalogs
CSV exports
Database query results
API response data
Survey results
Analytics data
Token savings: 60-73%
Good for:
Configuration files
Uniform object arrays
API payloads
Log data
Token savings: 50-65%
Avoid TOON for:
Highly nested data (greater than 3 levels)
Irregular/heterogeneous structures
Small payloads (less than 100 bytes)
Binary data
When JSON compatibility is critical
Interactive Demo
import streamlit as st
import json
from toon import encode, decode
import tiktoken
st.title( "🎯 Toonify Token Optimizer" )
# Token counter
enc = tiktoken.encoding_for_model( "gpt-4" )
def count_tokens ( text ):
return len (enc.encode(text))
# Input data
data_input = st.text_area(
"Paste your JSON data" ,
value = json.dumps({
"products" : [
{ "id" : 1 , "name" : "Laptop" , "price" : 1299 },
{ "id" : 2 , "name" : "Mouse" , "price" : 79 },
]
}, indent = 2 ),
height = 200
)
if data_input:
try :
# Parse JSON
data = json.loads(data_input)
# Convert to TOON
toon_str = encode(data)
# Calculate metrics
json_tokens = count_tokens(data_input)
toon_tokens = count_tokens(toon_str)
reduction = ((json_tokens - toon_tokens) / json_tokens) * 100
# Display results
col1, col2 = st.columns( 2 )
with col1:
st.subheader( "JSON Format" )
st.code(data_input, language = "json" )
st.metric( "Tokens" , json_tokens)
st.metric( "Bytes" , len (data_input))
with col2:
st.subheader( "TOON Format" )
st.code(toon_str, language = "text" )
st.metric( "Tokens" , toon_tokens)
st.metric( "Bytes" , len (toon_str))
# Savings
st.success( f "Token Reduction: { reduction :.1f} %" )
# Cost calculator
st.subheader( "Cost Savings Calculator" )
requests = st.number_input( "Number of API requests" , value = 1000 , step = 1000 )
gpt4_cost_per_1k = 0.03
json_cost = (json_tokens / 1000 ) * gpt4_cost_per_1k * requests
toon_cost = (toon_tokens / 1000 ) * gpt4_cost_per_1k * requests
savings = json_cost - toon_cost
st.write( f "**JSON cost**: $ { json_cost :.2f} " )
st.write( f "**TOON cost**: $ { toon_cost :.2f} " )
st.write( f "**💰 Savings**: $ { savings :.2f} " )
except Exception as e:
st.error( f "Error: { e } " )
Run with: streamlit run toonify_app.py
Token Reduction
Conversion Speed
Scale Test
Dataset Type Avg Reduction Best Case Worst Case Tabular 68.5% 73.4% 62.1% Structured JSON 61.2% 67.8% 54.3% Nested JSON 48.7% 56.2% 41.5% Mixed 55.4% 63.9% 47.8%
Operation Time (avg) Time (p99) Encode 0.42ms 1.2ms Decode 0.38ms 1.0ms Roundtrip 0.85ms 2.1ms
Test: 1000 product catalog
JSON: 125KB, 42,500 tokens
TOON: 48KB, 16,800 tokens
Reduction: 60.5% tokens, 61.6% size
Encoding time: 3.2ms
Monthly savings : $2,142 (at 100K requests/day)
Headroom Context Optimization
Reduce token usage by 47-92% through intelligent context compression for AI agents
What is Headroom?
Headroom is a context optimization layer that compresses tool outputs and conversation history while preserving accuracy. Unlike simple truncation, it uses statistical analysis to keep what matters.
Key Benefits
47-92% Token Reduction Verified across production workloads
Zero Code Changes Transparent proxy integration
Reversible Compression LLM can retrieve original data via CCR
Provider Caching Optimizes for OpenAI/Anthropic caching
Core Features
SmartCrusher
CacheAligner
CCR System
Statistical Compression Keeps:
First N items (context)
Last N items (recency)
Anomalies (statistical outliers)
Query-relevant matches
Removes:
Repetitive boilerplate
Redundant middle sections
Low-information content
from headroom import SmartCrusher
crusher = SmartCrusher(
keep_first = 2 ,
keep_last = 2 ,
keep_anomalies = True ,
compression_ratio = 0.3
)
# Compress tool output
compressed = crusher.compress(tool_output)
Prefix Optimization Stabilizes message prefixes for better provider-side caching: from headroom import CacheAligner
aligner = CacheAligner( provider = "anthropic" )
# Optimize for cache hits
optimized_messages = aligner.align(messages)
Cache hit rate improvement:
OpenAI: 35% → 78%
Anthropic: 42% → 85%
Compress-Cache-Retrieve Reversible compression:
Compress : Reduce tokens
Cache : Store original
Retrieve : LLM requests if needed
from headroom import CCR
ccr = CCR()
# Compress with retrieval capability
compressed, cache_id = ccr.compress(large_content)
# LLM can request original
if llm_needs_more_detail:
original = ccr.retrieve(cache_id)
Installation & Setup
Choose Integration Method
Proxy (Zero Code)
LangChain
Agno
# Start proxy server
headroom proxy --port 8787
# Point existing tools at proxy
export OPENAI_BASE_URL = http :// localhost : 8787 / v1
export ANTHROPIC_BASE_URL = http :// localhost : 8787
# Use tools normally - compression is automatic
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
# Wrap your model
llm = HeadroomChatModel(ChatOpenAI( model = "gpt-4o" ))
# Use normally
response = llm.invoke( "Analyze these logs" )
# Check savings
print ( f "Tokens saved: { llm.total_tokens_saved } " )
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap model
model = HeadroomAgnoModel(OpenAIChat( id = "gpt-4o" ))
# Create agent
agent = Agent(
model = model,
tools = [search_github, search_code, query_db]
)
# Run with automatic compression
response = agent.run( "Find memory leaks" )
print ( f "Saved: { model.total_tokens_saved } tokens" )
These are actual results from production API calls, not estimates.
Code Search
SRE Debugging
Agent Workflow
GitHub Code Search (100 results) Scenario : Search 100 code files for error handling patternsMetric Before After Savings Tokens 17,765 1,408 92% Cost (GPT-4) $0.53 $0.04 $0.49 Response Time 8.2s 2.1s 74%
Compression strategy:
Keep first 2 and last 2 results
Extract only relevant code sections
Remove boilerplate imports/comments
Preserve error handling patterns
Incident Log Analysis Scenario : Debug production outage from 65K token log fileMetric Before After Savings Tokens 65,694 5,118 92% Cost (GPT-4) $1.97 $0.15 $1.82 Time to insight 12.4s 3.2s 74%
What was preserved:
All ERROR and FATAL entries
Anomalous log patterns
First/last entries for timeline
Stack traces
What was compressed:
Repetitive INFO logs
Standard health checks
Redundant timestamps
Scenario : Agent using 5 tools (search, database, API, logs, docs)Tool Call Tokens Before Tokens After Reduction GitHub search 15,200 1,850 88% DB query 8,400 1,200 86% API response 12,600 2,100 83% Log analysis 18,900 2,400 87% Docs search 9,800 1,550 84% Total 64,900 9,100 86%
Monthly savings at 1K agent runs : $1,674
Needle in Haystack Test
Setup:
100 production log entries
1 critical FATAL error at position 67
Question: “What caused the outage? Error code? Fix?”
Baseline (no compression): Tokens: 10,144
Cost: $0.30
Response time: 4.8s
Answer: ✅ Correct (payment-gateway, PG-5523, increase max_connections)
With Headroom: Tokens: 1,260 (87.6% reduction)
Cost: $0.04 (86.7% savings)
Response time: 1.2s (75% faster)
Answer: ✅ Correct (same details)
What Headroom kept:
Position 67: FATAL error (the needle)
Position 1-2: Context (timeline start)
Position 99-100: Most recent state
Position 45: Anomaly (connection spike)
What Headroom removed:
96 INFO/DEBUG entries
Repetitive health checks
Standard operational logs
Result: Same accuracy, 87.6% fewer tokens
Configuration
LangChain integration with compression Underlying LangChain chat model
Target compression (0.3 = keep 30% of tokens)
Enable Compress-Cache-Retrieve for reversibility
Optimize for provider-side caching
Best Use Cases
Useful for:
Long chat sessions
Multi-turn debugging
Context-heavy conversations
Memory-intensive agents
Average savings: 50-70%
Safety Guarantees
Never Removes Human Content User and assistant messages are always preserved in full
Never Breaks Tool Pairing Tool calls and responses stay together
Parse Failures = No-op Malformed content passes through unchanged
Reversible Compression LLM can retrieve original data via CCR
Best Practices
import tiktoken
enc = tiktoken.encoding_for_model( "gpt-4" )
def measure_optimization ( original , optimized ):
original_tokens = len (enc.encode(original))
optimized_tokens = len (enc.encode(optimized))
reduction = ((original_tokens - optimized_tokens) / original_tokens) * 100
# Calculate cost savings (GPT-4 pricing)
cost_per_1k = 0.03
original_cost = (original_tokens / 1000 ) * cost_per_1k
optimized_cost = (optimized_tokens / 1000 ) * cost_per_1k
return {
"original_tokens" : original_tokens,
"optimized_tokens" : optimized_tokens,
"reduction_percent" : reduction,
"cost_savings" : original_cost - optimized_cost
}
from headroom.integrations import HeadroomChatModel
import logging
# Set up monitoring
llm = HeadroomChatModel(base_model)
# Log metrics
def log_metrics ():
logging.info( f "Total tokens saved: { llm.total_tokens_saved } " )
logging.info( f "Total cost saved: $ { llm.total_cost_saved :.2f} " )
logging.info( f "Compression ratio: { llm.avg_compression_ratio :.1%} " )
# Call after batch of requests
log_metrics()
from toon import encode
from headroom.integrations import HeadroomChatModel
# Use both TOON and Headroom
def optimized_agent_call ( structured_data , tools ):
# 1. Convert structured data to TOON
toon_data = encode(structured_data)
# 2. Use Headroom for tool outputs
llm = HeadroomChatModel(base_model)
# 3. Combine for maximum savings
response = llm.invoke(
f "Analyze this data and search for patterns: \n { toon_data } "
)
return response
# Result: 80-95% total token reduction
Cost Calculator
Interactive Cost Calculator
import streamlit as st
st.title( "LLM Optimization Cost Calculator" )
# Inputs
col1, col2 = st.columns( 2 )
with col1:
avg_tokens = st.number_input( "Average tokens per request" , value = 5000 , step = 100 )
requests_per_day = st.number_input( "Requests per day" , value = 1000 , step = 100 )
with col2:
model = st.selectbox( "Model" , [ "GPT-4" , "GPT-4o" , "Claude 3.5 Sonnet" ])
optimization = st.slider( "Token reduction %" , 0 , 95 , 60 )
# Pricing
pricing = {
"GPT-4" : 0.03 ,
"GPT-4o" : 0.0025 ,
"Claude 3.5 Sonnet" : 0.003
}
cost_per_1k = pricing[model]
# Calculate
monthly_requests = requests_per_day * 30
yearly_requests = requests_per_day * 365
# Baseline
baseline_monthly = (avg_tokens / 1000 ) * cost_per_1k * monthly_requests
baseline_yearly = (avg_tokens / 1000 ) * cost_per_1k * yearly_requests
# Optimized
optimized_tokens = avg_tokens * ( 1 - optimization / 100 )
optimized_monthly = (optimized_tokens / 1000 ) * cost_per_1k * monthly_requests
optimized_yearly = (optimized_tokens / 1000 ) * cost_per_1k * yearly_requests
# Display
st.subheader( "Cost Analysis" )
col1, col2, col3 = st.columns( 3 )
with col1:
st.metric( "Monthly Baseline" , f "$ { baseline_monthly :.2f} " )
st.metric( "Monthly Optimized" , f "$ { optimized_monthly :.2f} " )
st.metric( "Monthly Savings" , f "$ { baseline_monthly - optimized_monthly :.2f} " , delta = f "- { optimization } %" )
with col2:
st.metric( "Yearly Baseline" , f "$ { baseline_yearly :.2f} " )
st.metric( "Yearly Optimized" , f "$ { optimized_yearly :.2f} " )
st.metric( "Yearly Savings" , f "$ { baseline_yearly - optimized_yearly :.2f} " , delta = f "- { optimization } %" )
with col3:
st.metric( "3-Year Baseline" , f "$ { baseline_yearly * 3 :.2f} " )
st.metric( "3-Year Optimized" , f "$ { optimized_yearly * 3 :.2f} " )
st.metric( "3-Year Savings" , f "$ { (baseline_yearly - optimized_yearly) * 3 :.2f} " , delta = f "- { optimization } %" )
Resources
Toonify GitHub TOON format library and examples
Headroom GitHub Context optimization framework
Example Apps Complete optimization demos
OpenAI Tokenizer Test token counting