Skip to main content
Debugging LLM applications is different from traditional software debugging. Issues can be subtle - wrong responses, inconsistent behavior, or silent failures that only affect quality, not functionality. Helicone provides comprehensive debugging tools to identify, diagnose, and resolve issues in your LLM applications.

Common LLM Issues

Errors & Timeouts

API failures, rate limits, timeouts, and provider outages

Quality Issues

Wrong answers, inconsistent outputs, hallucinations, and context loss

Performance Problems

Slow responses, high latency, and token inefficiency

Cost Overruns

Unexpected spending, inefficient prompts, and model selection

Debugging Workflow

1

Filter by Status Codes

Start by identifying failed requests using status code filters:
Helicone request page showing status code filter for error identification
Common status codes:
  • 200 - Success
  • 400 - Bad request (malformed input)
  • 401 - Authentication failed
  • 429 - Rate limit exceeded
  • 500 - Provider error
  • 503 - Provider unavailable
// Add request IDs for easier debugging
const requestId = `req-${Date.now()}`;

const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [...],
  },
  {
    headers: {
      "Helicone-Request-Id": requestId,
      "Helicone-Property-Feature": "document-processing",
    },
  }
);
2

Inspect Request Details

Click on any request to see complete details:
Helicone request detail page with full request and response data
Key information available:
  • Full request body - Exact prompt and parameters sent
  • Complete response - What the model returned
  • Timing breakdown - Where latency occurred
  • Token usage - Input/output token counts
  • Cost - Exact cost of this request
  • Custom properties - Your metadata for filtering
3

Use Playground for Testing

Test fixes immediately without redeploying code:
Playground button on request detail page
The Playground allows you to:
  • Modify the prompt and see new results
  • Change model parameters (temperature, max tokens)
  • Switch models to compare outputs
  • Test different approaches quickly
Helicone playground interface for testing prompts
Currently, only OpenAI models are supported in the Playground
4

Track Sessions for Context

Debug issues in multi-turn conversations by viewing complete sessions:
const sessionId = `session-${userId}-${Date.now()}`;

// First request in conversation
await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [{ role: "user", content: "Hello" }],
  },
  {
    headers: {
      "Helicone-Session-Id": sessionId,
      "Helicone-Session-Name": "Customer Chat",
      "Helicone-Session-Path": "/greeting",
    },
  }
);

// Follow-up request (same session)
await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: conversationHistory,
  },
  {
    headers: {
      "Helicone-Session-Id": sessionId,
      "Helicone-Session-Path": "/follow-up",
    },
  }
);
Sessions help you:
  • See the full conversation context
  • Identify where context was lost
  • Track how costs accumulate
  • Understand user interaction patterns

Debugging Specific Issues

API Errors & Rate Limits

When you see 429 or 500 errors:
async function makeRequestWithRetry(
  client: OpenAI,
  params: any,
  maxRetries = 3
) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await client.chat.completions.create(
        params,
        {
          headers: {
            "Helicone-Property-Retry-Attempt": String(i),
          },
        }
      );
    } catch (error: any) {
      if (error?.status === 429 && i < maxRetries - 1) {
        // Exponential backoff
        const delay = Math.pow(2, i) * 1000;
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      throw error;
    }
  }
}

Quality Issues

When responses are wrong or inconsistent:
Filter requests by custom properties to identify patterns:
headers: {
  "Helicone-Property-Query-Type": "technical-support",
  "Helicone-Property-User-Type": "premium",
}
Then filter in the dashboard to see:
  • Do technical queries fail more often?
  • Are premium users having different issues?
  • Which features have the most quality problems?
Tag requests with model versions to compare quality:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    extra_headers={
        "Helicone-Property-Prompt-Version": "v2.1",
        "Helicone-Property-System-Prompt": "technical-assistant"
    }
)
This helps you:
  • A/B test prompt changes
  • Track quality regressions
  • Identify which version works best
Add quality scores to track improvements:
// After getting user feedback
await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    scores: {
      "user-satisfaction": 5,
      "accuracy": 0.9,
      "helpfulness": 4,
    },
  }),
});

Performance Problems

When responses are slow:
Check the request details for timing breakdown:
  • Queue time - How long before processing started
  • Processing time - Model inference time
  • Network time - Transfer latency
// Add timing metadata
const startTime = Date.now();

const response = await client.chat.completions.create(
  params,
  {
    headers: {
      "Helicone-Property-Client-Start-Time": String(startTime),
    },
  }
);

const endTime = Date.now();
console.log(`Total latency: ${endTime - startTime}ms`);

Cost Overruns

When costs are higher than expected:
// Add cost tracking properties
headers: {
  "Helicone-Property-Feature": "document-analysis",
  "Helicone-Property-Document-Length": String(docLength),
  "Helicone-Session-Id": sessionId,
}
Then analyze in the dashboard:
  1. Filter by feature to find expensive operations
  2. Check session costs to see complete workflows
  3. Review token usage to identify inefficient prompts
  4. Compare model costs to find cheaper alternatives
See the Cost Tracking guide for detailed optimization strategies.

Advanced Debugging Techniques

Custom Request IDs

Use predictable IDs to correlate with your own logs:
const requestId = `${userId}-${feature}-${timestamp}`;

headers: {
  "Helicone-Request-Id": requestId,
}
Then search for this ID in both Helicone and your application logs.

Property-Based Filtering

Tag requests with rich metadata for powerful filtering:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    extra_headers={
        "Helicone-Property-Environment": os.getenv("ENV"),
        "Helicone-Property-User-Tier": user.tier,
        "Helicone-Property-Feature": "search",
        "Helicone-Property-Version": "v2.3",
        "Helicone-Property-AB-Test": "prompt-variant-B",
    }
)
Filter combinations like:
  • “Show me production errors for premium users”
  • “Compare v2.3 vs v2.2 response times”
  • “Which A/B test variant has better quality?”

Session Replay

Replay entire sessions to reproduce issues:
  1. Find the problematic session in the dashboard
  2. Click “Replay Session”
  3. View the exact sequence of requests
  4. Test fixes against the same inputs
Session replay is especially useful for debugging multi-turn conversations where context matters.

Debugging Checklist

When investigating an issue:
  • Check status codes for obvious errors
  • Review request/response in detail
  • Test fixes in Playground
  • Look at session context if multi-turn
  • Filter by custom properties to find patterns
  • Compare with working requests
  • Check timing breakdown for performance
  • Review token usage for cost issues
  • Add more logging for future debugging

Proactive Debugging

Prevent issues before they happen:

Set Up Alerts

// Configure in Helicone dashboard:
// 1. Error rate > 5%
// 2. Average latency > 2 seconds
// 3. Daily cost > $100
// 4. Any 500 errors

Add Comprehensive Logging

function makeTrackedRequest(feature: string, userId: string, params: any) {
  return client.chat.completions.create(
    params,
    {
      headers: {
        "Helicone-Session-Id": `${userId}-${Date.now()}`,
        "Helicone-Property-Feature": feature,
        "Helicone-Property-Environment": process.env.NODE_ENV,
        "Helicone-Property-Version": APP_VERSION,
        "Helicone-User-Id": userId,
      },
    }
  );
}

Monitor Key Metrics

Track these metrics weekly:
  • Error rate - Should stay below 2%
  • P95 latency - Should be under 3 seconds
  • Average cost per session - Watch for increases
  • Cache hit rate - Should be above 50% for cacheable content

Debugging Tools Reference

Request Filters

Filter by status, model, properties, and more

Sessions

Track multi-turn conversations and workflows

Custom Properties

Add metadata for powerful filtering

Alerts

Get notified of issues immediately

Next Steps

Agent Tracing

Debug complex agent workflows with tool calls

Cost Tracking

Identify and optimize expensive operations

Experiments

A/B test fixes before deploying to production

Build docs developers (and LLMs) love