Skip to main content

Overview

LLM API costs are directly tied to token count. These optimization tools help you reduce token usage while maintaining accuracy, enabling cost-effective AI applications at scale.

TOON Format

63.9% average token reduction for structured data

Headroom

47-92% token savings through intelligent compression

Why Optimize?

Impact on API Costs

Based on GPT-4 pricing ($0.03/1K input tokens):
Usage VolumeStandard CostWith Optimization (60% reduction)Savings
1,000 calls$2.55$1.02$1.53
100,000 calls$255.00$102.00$153.00
1M calls$2,550.00$1,020.00$1,530.00
10M calls$25,500.00$10,200.00$15,300.00

Toonify Token Optimization

Reduce token usage by 30-73% using TOON (Token-Oriented Object Notation) format

What is TOON?

TOON is a compact serialization format designed specifically for LLM token efficiency. It achieves CSV-like compression while maintaining structure and readability.

Key Benefits

63.9% Average Reduction

Verified across 50 real-world datasets

73.4% for Tabular Data

Optimal for structured, uniform data

Human Readable

Still easy to understand and debug

<1ms Overhead

Negligible conversion time

Format Comparison

{
  "products": [
    {"id": 101, "name": "Laptop Pro", "price": 1299},
    {"id": 102, "name": "Magic Mouse", "price": 79},
    {"id": 103, "name": "USB-C Cable", "price": 19}
  ]
}
Token Savings: 85 → 39 tokens (54.1% reduction)Cost Impact: 2.55/1Krequests2.55/1K requests → 1.17/1K requests

Implementation

1

Install Toonify

pip install toonify
2

Convert Data to TOON

from toon import encode, decode
import json

# Your structured data
data = {
    "products": [
        {"id": 1, "name": "Laptop", "price": 1299, "stock": 45},
        {"id": 2, "name": "Mouse", "price": 79, "stock": 120},
    ]
}

# Convert to TOON format
toon_str = encode(data)
print(f"JSON: {len(json.dumps(data))} bytes")
print(f"TOON: {len(toon_str)} bytes")
3

Send to LLM

from openai import OpenAI

client = OpenAI()

# Use TOON format in prompt
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Analyze this product data:\n\n{toon_str}"
    }]
)

print(response.choices[0].message.content)
4

Decode if Needed

# Convert back to Python objects
original_data = decode(toon_str)
assert original_data == data  # Roundtrip verification

Real-World Example

from toon import encode
from openai import OpenAI
import json

client = OpenAI()

# Sample product catalog (could be 100s of products)
products = [
    {"id": 1, "name": "Laptop Pro", "price": 1299, "stock": 45, "category": "Electronics"},
    {"id": 2, "name": "Magic Mouse", "price": 79, "stock": 120, "category": "Accessories"},
    {"id": 3, "name": "USB-C Hub", "price": 49, "stock": 200, "category": "Accessories"},
    {"id": 4, "name": "4K Monitor", "price": 599, "stock": 30, "category": "Electronics"},
    {"id": 5, "name": "Keyboard", "price": 129, "stock": 85, "category": "Accessories"},
    # ... potentially hundreds more
]

# Measure token reduction
json_str = json.dumps(products)
toon_str = encode(products)

print(f"JSON size: {len(json_str)} bytes")
print(f"TOON size: {len(toon_str)} bytes")
print(f"Reduction: {((len(json_str) - len(toon_str)) / len(json_str) * 100):.1f}%")

# Send optimized data to LLM
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"""
        Analyze this product catalog and provide:
        1. Total inventory value
        2. Low stock items (< 50 units)
        3. Average price by category
        
        Data:
        {toon_str}
        """
    }]
)

print(response.choices[0].message.content)
Results:
  • JSON: 487 bytes, ~165 tokens
  • TOON: 186 bytes, ~68 tokens
  • Reduction: 58.8% tokens, 61.8% bytes

Best Use Cases

Optimal for:
  • Product catalogs
  • CSV exports
  • Database query results
  • API response data
  • Survey results
  • Analytics data
Token savings: 60-73%
Good for:
  • Configuration files
  • Uniform object arrays
  • API payloads
  • Log data
Token savings: 50-65%
Avoid TOON for:
  • Highly nested data (greater than 3 levels)
  • Irregular/heterogeneous structures
  • Small payloads (less than 100 bytes)
  • Binary data
  • When JSON compatibility is critical

Interactive Demo

import streamlit as st
import json
from toon import encode, decode
import tiktoken

st.title("🎯 Toonify Token Optimizer")

# Token counter
enc = tiktoken.encoding_for_model("gpt-4")

def count_tokens(text):
    return len(enc.encode(text))

# Input data
data_input = st.text_area(
    "Paste your JSON data",
    value=json.dumps({
        "products": [
            {"id": 1, "name": "Laptop", "price": 1299},
            {"id": 2, "name": "Mouse", "price": 79},
        ]
    }, indent=2),
    height=200
)

if data_input:
    try:
        # Parse JSON
        data = json.loads(data_input)
        
        # Convert to TOON
        toon_str = encode(data)
        
        # Calculate metrics
        json_tokens = count_tokens(data_input)
        toon_tokens = count_tokens(toon_str)
        reduction = ((json_tokens - toon_tokens) / json_tokens) * 100
        
        # Display results
        col1, col2 = st.columns(2)
        
        with col1:
            st.subheader("JSON Format")
            st.code(data_input, language="json")
            st.metric("Tokens", json_tokens)
            st.metric("Bytes", len(data_input))
        
        with col2:
            st.subheader("TOON Format")
            st.code(toon_str, language="text")
            st.metric("Tokens", toon_tokens)
            st.metric("Bytes", len(toon_str))
        
        # Savings
        st.success(f"Token Reduction: {reduction:.1f}%")
        
        # Cost calculator
        st.subheader("Cost Savings Calculator")
        requests = st.number_input("Number of API requests", value=1000, step=1000)
        
        gpt4_cost_per_1k = 0.03
        json_cost = (json_tokens / 1000) * gpt4_cost_per_1k * requests
        toon_cost = (toon_tokens / 1000) * gpt4_cost_per_1k * requests
        savings = json_cost - toon_cost
        
        st.write(f"**JSON cost**: ${json_cost:.2f}")
        st.write(f"**TOON cost**: ${toon_cost:.2f}")
        st.write(f"**💰 Savings**: ${savings:.2f}")
        
    except Exception as e:
        st.error(f"Error: {e}")
Run with: streamlit run toonify_app.py

Performance Benchmarks

Dataset TypeAvg ReductionBest CaseWorst Case
Tabular68.5%73.4%62.1%
Structured JSON61.2%67.8%54.3%
Nested JSON48.7%56.2%41.5%
Mixed55.4%63.9%47.8%

Headroom Context Optimization

Reduce token usage by 47-92% through intelligent context compression for AI agents

What is Headroom?

Headroom is a context optimization layer that compresses tool outputs and conversation history while preserving accuracy. Unlike simple truncation, it uses statistical analysis to keep what matters.

Key Benefits

47-92% Token Reduction

Verified across production workloads

Zero Code Changes

Transparent proxy integration

Reversible Compression

LLM can retrieve original data via CCR

Provider Caching

Optimizes for OpenAI/Anthropic caching

Core Features

Statistical Compression

Keeps:
  • First N items (context)
  • Last N items (recency)
  • Anomalies (statistical outliers)
  • Query-relevant matches
Removes:
  • Repetitive boilerplate
  • Redundant middle sections
  • Low-information content
from headroom import SmartCrusher

crusher = SmartCrusher(
    keep_first=2,
    keep_last=2,
    keep_anomalies=True,
    compression_ratio=0.3
)

# Compress tool output
compressed = crusher.compress(tool_output)

Installation & Setup

1

Install Headroom

pip install headroom-ai
2

Choose Integration Method

# Start proxy server
headroom proxy --port 8787

# Point existing tools at proxy
export OPENAI_BASE_URL=http://localhost:8787/v1
export ANTHROPIC_BASE_URL=http://localhost:8787

# Use tools normally - compression is automatic

Real-World Performance

These are actual results from production API calls, not estimates.

Needle in Haystack Test

Setup:
  • 100 production log entries
  • 1 critical FATAL error at position 67
  • Question: “What caused the outage? Error code? Fix?”
Baseline (no compression):
Tokens: 10,144
Cost: $0.30
Response time: 4.8s
Answer: ✅ Correct (payment-gateway, PG-5523, increase max_connections)
With Headroom:
Tokens: 1,260 (87.6% reduction)
Cost: $0.04 (86.7% savings)
Response time: 1.2s (75% faster)
Answer: ✅ Correct (same details)
What Headroom kept:
  • Position 67: FATAL error (the needle)
  • Position 1-2: Context (timeline start)
  • Position 99-100: Most recent state
  • Position 45: Anomaly (connection spike)
What Headroom removed:
  • 96 INFO/DEBUG entries
  • Repetitive health checks
  • Standard operational logs
Result: Same accuracy, 87.6% fewer tokens

Configuration

HeadroomChatModel
class
LangChain integration with compression

Best Use Cases

Optimal for:
  • Multi-tool workflows
  • Code search agents
  • Database query agents
  • API integration agents
  • Log analysis agents
Average savings: 75-90%
Ideal for:
  • Code search results (100+ files)
  • Database query results (1000+ rows)
  • API responses (large JSON)
  • Log files (10K+ lines)
  • Documentation searches
Average savings: 80-92%
Useful for:
  • Long chat sessions
  • Multi-turn debugging
  • Context-heavy conversations
  • Memory-intensive agents
Average savings: 50-70%

Safety Guarantees

Never Removes Human Content

User and assistant messages are always preserved in full

Never Breaks Tool Pairing

Tool calls and responses stay together

Parse Failures = No-op

Malformed content passes through unchanged

Reversible Compression

LLM can retrieve original data via CCR

Best Practices

Use CaseToolExpected Savings
Structured data (JSON, CSV)TOON60-73%
AI agent tool outputsHeadroom75-92%
Large API responsesBoth80-95%
Conversation historyHeadroom50-70%
Mixed/nested JSONTOON45-60%
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

def measure_optimization(original, optimized):
    original_tokens = len(enc.encode(original))
    optimized_tokens = len(enc.encode(optimized))
    
    reduction = ((original_tokens - optimized_tokens) / original_tokens) * 100
    
    # Calculate cost savings (GPT-4 pricing)
    cost_per_1k = 0.03
    original_cost = (original_tokens / 1000) * cost_per_1k
    optimized_cost = (optimized_tokens / 1000) * cost_per_1k
    
    return {
        "original_tokens": original_tokens,
        "optimized_tokens": optimized_tokens,
        "reduction_percent": reduction,
        "cost_savings": original_cost - optimized_cost
    }
from headroom.integrations import HeadroomChatModel
import logging

# Set up monitoring
llm = HeadroomChatModel(base_model)

# Log metrics
def log_metrics():
    logging.info(f"Total tokens saved: {llm.total_tokens_saved}")
    logging.info(f"Total cost saved: ${llm.total_cost_saved:.2f}")
    logging.info(f"Compression ratio: {llm.avg_compression_ratio:.1%}")

# Call after batch of requests
log_metrics()
from toon import encode
from headroom.integrations import HeadroomChatModel

# Use both TOON and Headroom
def optimized_agent_call(structured_data, tools):
    # 1. Convert structured data to TOON
    toon_data = encode(structured_data)
    
    # 2. Use Headroom for tool outputs
    llm = HeadroomChatModel(base_model)
    
    # 3. Combine for maximum savings
    response = llm.invoke(
        f"Analyze this data and search for patterns:\n{toon_data}"
    )
    
    return response

# Result: 80-95% total token reduction

Cost Calculator

import streamlit as st

st.title("LLM Optimization Cost Calculator")

# Inputs
col1, col2 = st.columns(2)

with col1:
    avg_tokens = st.number_input("Average tokens per request", value=5000, step=100)
    requests_per_day = st.number_input("Requests per day", value=1000, step=100)
    
with col2:
    model = st.selectbox("Model", ["GPT-4", "GPT-4o", "Claude 3.5 Sonnet"])
    optimization = st.slider("Token reduction %", 0, 95, 60)

# Pricing
pricing = {
    "GPT-4": 0.03,
    "GPT-4o": 0.0025,
    "Claude 3.5 Sonnet": 0.003
}

cost_per_1k = pricing[model]

# Calculate
monthly_requests = requests_per_day * 30
yearly_requests = requests_per_day * 365

# Baseline
baseline_monthly = (avg_tokens / 1000) * cost_per_1k * monthly_requests
baseline_yearly = (avg_tokens / 1000) * cost_per_1k * yearly_requests

# Optimized
optimized_tokens = avg_tokens * (1 - optimization / 100)
optimized_monthly = (optimized_tokens / 1000) * cost_per_1k * monthly_requests
optimized_yearly = (optimized_tokens / 1000) * cost_per_1k * yearly_requests

# Display
st.subheader("Cost Analysis")

col1, col2, col3 = st.columns(3)

with col1:
    st.metric("Monthly Baseline", f"${baseline_monthly:.2f}")
    st.metric("Monthly Optimized", f"${optimized_monthly:.2f}")
    st.metric("Monthly Savings", f"${baseline_monthly - optimized_monthly:.2f}", delta=f"-{optimization}%")

with col2:
    st.metric("Yearly Baseline", f"${baseline_yearly:.2f}")
    st.metric("Yearly Optimized", f"${optimized_yearly:.2f}")
    st.metric("Yearly Savings", f"${baseline_yearly - optimized_yearly:.2f}", delta=f"-{optimization}%")

with col3:
    st.metric("3-Year Baseline", f"${baseline_yearly * 3:.2f}")
    st.metric("3-Year Optimized", f"${optimized_yearly * 3:.2f}")
    st.metric("3-Year Savings", f"${(baseline_yearly - optimized_yearly) * 3:.2f}", delta=f"-{optimization}%")

Resources

Toonify GitHub

TOON format library and examples

Headroom GitHub

Context optimization framework

Example Apps

Complete optimization demos

OpenAI Tokenizer

Test token counting

Build docs developers (and LLMs) love