Iterative Improvement

Overview

The agent performs iterative self-improvement until the answer meets quality thresholds. This is a key differentiator from single-pass RAG systems.

The agent evaluates its own answers, identifies gaps, and searches for additional information until confident or max iterations reached.

Iteration Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ITERATION LOOP                                │
│                                                                  │
│  ┌──────────────────┐                                           │
│  │ Generate Answer  │◄──────────────────────────────────┐       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐                                   │       │
│  │ Evaluate Quality │                                   │       │
│  │ • completeness   │                                   │       │
│  │ • specificity    │                                   │       │
│  │ • accuracy       │                                   │       │
│  │ • vs. reasoning  │ ← Checks if reasoning goals met   │       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐    YES    ┌─────────────────┐    │       │
│  │ Confidence < 90% │─────────► │ Search for more │────┘       │
│  │ & iterations left│           │ context (tools) │            │
│  └────────┬─────────┘           └─────────────────┘            │
│           │ NO                                                  │
│           ▼                                                     │
│     ┌───────────┐                                               │
│     │  OUTPUT   │                                               │
│     └───────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Quality Evaluation

The agent uses a strict multi-dimensional scoring system (0-100 scale) to evaluate answer quality.

Evaluation Dimensions

Completeness Score (0-100)

Question: Does the answer fully address the question?Criteria:

All parts of the question answered
No critical information missing
Reasoning goals from Stage 2 met

Example:

Question: "What was Apple's Q4 revenue and how does it compare to Q3?"

Answer A (Score: 50):
"Apple's Q4 revenue was $89.5B."
↓ Missing comparison to Q3

Answer B (Score: 100):
"Apple's Q4 revenue was $89.5B, up from $85.8B in Q3, 
representing a 4.3% sequential increase."
↓ Complete answer

Specificity Score (0-100)

Question: Does it include specific numbers, quotes, and details?Criteria:

Exact financial figures (not “increased significantly”)
Direct quotes from transcripts/filings
Specific time periods (Q1 2025, not “recently”)
Proper citations

Example:

Low Specificity (Score: 40):
"Management was optimistic about AI."

High Specificity (Score: 95):
"CEO Tim Cook stated 'we see AI as transformative for our 
products' and noted AI features drove a 12% increase in iPhone 
upgrades [1]."

Accuracy Score (0-100)

Question: Is the information factually correct based on sources?Criteria:

Numbers match source material exactly
No hallucinated information
Calculations performed correctly
Citations point to correct sources

Verification:

Cross-reference numbers against chunks
Validate calculations (e.g., growth rates)
Ensure quotes are verbatim

Clarity Score (0-100)

Question: Is the response well-structured and easy to understand?Criteria:

Logical organization
Clear language (avoid jargon where possible)
Proper formatting (bullets, sections)
Smooth flow between ideas

Overall Confidence Calculation

overall_confidence = (
    0.35 * completeness_score +
    0.30 * specificity_score +
    0.25 * accuracy_score +
    0.10 * clarity_score
) / 100  # Normalize to 0-1 scale

Weights prioritize completeness and specificity, as these are most critical for financial Q&A.

Evaluation Process

Generate evaluation prompt

def evaluate_answer_quality(question, answer, reasoning, chunks):
    """
    LLM evaluates answer against:
    - Original question
    - Research reasoning (from Stage 2)
    - Available source material (chunks)
    """

LLM scores each dimension

Returns structured evaluation:

{
  "completeness_score": 85,
  "specificity_score": 90,
  "accuracy_score": 95,
  "clarity_score": 88,
  "overall_confidence": 0.89,
  "issues": [
    "Missing year-over-year comparison"
  ],
  "missing_info": [
    "Prior year Q4 revenue figure"
  ],
  "suggestions": [
    "Search for Q4 2023 revenue data"
  ]
}

Check against threshold

Compare to answer mode threshold:

direct: 0.70
standard: 0.80
detailed: 0.90
deep_search: 0.95

Stream evaluation to frontend

Event type: agent_decision

{
  "type": "agent_decision",
  "message": "Answer quality: 89%. Missing YoY comparison.",
  "data": {
    "confidence": 0.89,
    "issues": ["Missing year-over-year comparison"],
    "will_iterate": true
  }
}

Follow-Up Search Strategy

When the agent needs more information, it generates search-optimized keyword phrases (not verbose questions).

Why Keyword Phrases?

Problem with Questions
Solution: Keywords

OLD Approach (verbose questions):

❌ "What specific revenue growth percentage was reported and how does
   it compare to the previous quarter?"
❌ "Did executives provide updated capex guidance for 2025, and what
   portion was specifically tied to AI?"

Issues:

Question framing adds noise (“What”, “How”, “Did”)
Doesn’t match vector search patterns
Too verbose for semantic similarity

NEW Approach (search-optimized keywords):

✅ "revenue growth percentage quarter comparison"
✅ "capex guidance 2025 AI allocation"
✅ "specific metrics last three quarters"

Benefits:

Better semantic search - vector databases prefer declarative phrases
Removes noise - no question framing words
Focuses on entities - extracts core concepts and metrics
Preserves context - includes tickers and time periods

Keyword Phrase Generation

def generate_followup_keywords(evaluation, reasoning, question):
    """
    Input:
        evaluation.missing_info = ["Prior year Q4 revenue figure"]
        evaluation.suggestions = ["Search for Q4 2023 revenue data"]
    
    Output:
        [
            "Q4 2023 revenue quarterly results",
            "year over year revenue comparison 2024 2023"
        ]
    
    Characteristics:
    - Declarative keyword phrases (not questions)
    - Include temporal scope (Q4 2023, 2024 vs 2023)
    - Extract entities (revenue, quarterly results)
    - Remove question words (what, how, did)
    """

Parallel Quarter Search

Each keyword phrase searches ALL target quarters in parallel:

Show Parallel Execution Example

Question: "What's the capex trend for last 3 quarters?"
Target quarters: [2024_q4, 2025_q1, 2025_q2]

Follow-up keyword: "capex guidance investment allocation"

Execution:
├── Search "capex guidance investment allocation" in 2024_q4 (parallel)
├── Search "capex guidance investment allocation" in 2025_q1 (parallel)
└── Search "capex guidance investment allocation" in 2025_q2 (parallel)

Result: All chunks deduped by citation, merged into context

Why this matters: If “capex guidance” appears in Q2 and Q4 but not Q1, both Q2 and Q4 chunks are retrieved.

Tool Selection

The agent can request additional searches from any data source:

Transcript Search
News Search
Both

{
  "needs_transcript_search": true,
  "followup_keywords": [
    "margin trends profitability metrics",
    "guidance outlook projections"
  ]
}

Searches ALL target quarters with each keyword phrase.

{
  "needs_news_search": true,
  "news_query": "Microsoft cloud revenue latest developments"
}

Fetches real-time news to supplement historical data.

{
  "needs_transcript_search": true,
  "needs_news_search": true,
  "followup_keywords": ["AI infrastructure capex"],
  "news_query": "AI infrastructure investments 2025"
}

Can request multiple sources in single iteration.

Iteration Flow

Iteration N starts

Stream event: iteration_start

{
  "type": "iteration_start",
  "data": {"iteration": 2, "max_iterations": 3}
}

Evaluate previous answer

LLM scores quality and identifies gaps

Check termination conditions

if confidence >= threshold:
    return answer  # Early termination

if iteration >= max_iterations:
    return answer  # Max iterations reached

if no followup_keywords:
    return answer  # Agent satisfied

Generate follow-up keywords

Stream event: iteration_followup

{
  "type": "iteration_followup",
  "message": "Searching for: capex guidance, margin trends",
  "data": {
    "keywords": ["capex guidance 2025", "margin trends profitability"]
  }
}

Execute searches

Parallel quarter search for each keyword
Optional news search
Deduplicate and merge new chunks

Stream new chunks found

Event: iteration_search

{
  "type": "iteration_search",
  "message": "Found 8 new chunks",
  "data": {"chunks_added": 8}
}

Regenerate answer

Use ALL accumulated chunks (previous + new)

Mark iteration complete

Event: iteration_complete

{
  "type": "iteration_complete",
  "data": {
    "iteration": 2,
    "confidence": 0.91,
    "will_continue": false
  }
}

Answer Mode Configuration

Different answer modes balance speed vs. thoroughness:

Direct Mode
Standard Mode (Default)
Detailed Mode
Deep Search Mode

Configuration:

{
    "max_iterations": 2,
    "max_tokens": 2000,
    "confidence_threshold": 0.7,
}

When Used:

Quick factual lookups
Single-number questions
Time-sensitive queries

Example Questions:

“What was Q4 revenue?”
“When is the next earnings call?”
“What’s the current stock price target?”

Typical Performance:

1-2 iterations
~5-8 seconds

Configuration:

{
    "max_iterations": 3,
    "max_tokens": 6000,
    "confidence_threshold": 0.8,
}

When Used:

Default for most questions
Balanced analysis needs
Moderate complexity

Example Questions:

“Explain Microsoft’s cloud strategy”
“How is Azure performing vs AWS?”
“What are the key risk factors?”

Typical Performance:

2-3 iterations
~10-15 seconds

Configuration:

{
    "max_iterations": 4,
    "max_tokens": 16000,
    "confidence_threshold": 0.9,
}

When Used:

Comprehensive analysis
Multi-part questions
Research-depth answers

Example Questions:

“Analyze margin trends over last 3 quarters and explain drivers”
“Compare revenue mix across segments for AAPL vs MSFT”
“What factors contributed to operating expense changes?”

Typical Performance:

3-4 iterations
~15-25 seconds

Configuration:

{
    "max_iterations": 10,
    "max_tokens": 20000,
    "confidence_threshold": 0.95,
}

When Used:

Reserved for future use
Not currently emitted by combined reasoning stage
Potential for exhaustive research

Note:

This mode is configured but not currently used in production. The agent may suggest “Want me to search thoroughly?” but doesn’t automatically select this mode.

Termination Conditions

The iteration loop stops when any of these conditions are met:

Confidence Threshold Met

if overall_confidence >= threshold:
    logger.info(f"Confidence {confidence:.0%} >= {threshold:.0%}, stopping")
    return answer

Threshold by mode:

Direct: 70%
Standard: 80%
Detailed: 90%
Deep Search: 95%

Max Iterations Reached

if iteration >= max_iterations:
    logger.info(f"Max iterations ({max_iterations}) reached")
    return answer

Limits by mode:

Direct: 2
Standard: 3
Detailed: 4
Deep Search: 10

Agent Satisfaction

if evaluation.get("is_sufficient", False):
    logger.info("Agent determined answer is sufficient")
    return answer

Agent can decide answer is good enough even below threshold.

No Follow-Up Keywords

if not followup_keywords:
    logger.info("No additional searches identified")
    return answer

If agent can’t generate meaningful follow-up searches, stop.

Practical Examples

Example 1: Quick Convergence
Example 2: Needs More Context
Example 3: Complex Multi-Part

Question: “What was Apple’s Q4 2024 revenue?”Mode: Direct (max 2 iterations, 70% threshold)

Iteration 1:

Initial search: "Q4 2024 revenue quarterly results"
Retrieved: 12 chunks

Answer: "Apple's Q4 2024 revenue was $94.9B [1]."

Evaluation:
- Completeness: 95 (direct answer to question)
- Specificity: 100 (exact number with citation)
- Accuracy: 100 (matches source)
- Clarity: 95 (clear and concise)
- Overall: 0.97 >= 0.70 ✓

Result: 1 iteration, ~6 seconds, early termination

Question: “How has Microsoft’s cloud revenue grown and what’s driving it?”Mode: Standard (max 3 iterations, 80% threshold)

Iteration 1:

Initial search: "cloud revenue growth Azure"
Retrieved: 15 chunks

Answer: "Microsoft's cloud revenue grew 22% YoY to $28.5B [1]."

Evaluation:
- Completeness: 60 (missing "what's driving it")
- Specificity: 90 (has numbers)
- Accuracy: 95 (correct figures)
- Clarity: 85
- Overall: 0.75 < 0.80 ✗

Missing: Growth drivers, segment breakdown

Iteration 2:

Follow-up keywords:
- "Azure growth drivers AI adoption"
- "cloud segment breakdown Office 365"

Retrieved: 8 new chunks
Total: 23 chunks

Answer: "Microsoft's cloud revenue grew 22% YoY to $28.5B [1], 
driven by Azure growth of 31% [2] and Office 365 commercial growth 
of 15% [3]. Management cited AI adoption as a key driver, with AI 
services contributing 6 percentage points to Azure growth [4]."

Evaluation:
- Completeness: 95 (addresses both parts)
- Specificity: 95 (specific numbers and drivers)
- Accuracy: 95
- Clarity: 90
- Overall: 0.94 >= 0.80 ✓

Result: 2 iterations, ~12 seconds, early termination

Question: “Compare AAPL and MSFT operating margins over last 3 quarters and explain any trends.”Mode: Detailed (max 4 iterations, 90% threshold)

Iteration 1:

Initial search (parallel per ticker):
- AAPL: "operating margin profitability" → quarters [Q2, Q3, Q4]
- MSFT: "operating margin profitability" → quarters [Q2, Q3, Q4]

Retrieved: 30 chunks (15 per company)

Answer: Basic margin numbers for each quarter, no trend analysis

Evaluation:
- Completeness: 70 (has numbers, missing trend explanation)
- Specificity: 85
- Accuracy: 90
- Clarity: 80
- Overall: 0.79 < 0.90 ✗

Iteration 2:

Follow-up keywords:
- "operating expense trends cost efficiency"
- "margin drivers profitability improvements"

Retrieved: 12 new chunks
Total: 42 chunks

Answer: Added trend analysis, still missing comparative insights

Evaluation: 0.84 < 0.90 ✗

Iteration 3:

Follow-up keywords:
- "margin comparison competitive positioning"

Retrieved: 6 new chunks
Total: 48 chunks

Answer: Complete comparative analysis with trends and drivers

Evaluation:
- Completeness: 95 (all parts addressed)
- Specificity: 92 (detailed numbers and quotes)
- Accuracy: 95
- Clarity: 92
- Overall: 0.94 >= 0.90 ✓

Result: 3 iterations, ~18 seconds, early termination

Performance Characteristics

Average Iteration Counts

Answer Mode	Avg Iterations	Avg Time
Direct	1.3	~6s
Standard	2.1	~12s
Detailed	2.8	~18s
Deep Search	N/A	N/A

Early Termination Rate

~65% of questions terminate early (before max iterations) due to confidence threshold being met.

Chunk Accumulation

Typical chunk counts:
- Iteration 1: 15-20 chunks
- Iteration 2: +8-12 chunks (23-32 total)
- Iteration 3: +4-8 chunks (27-40 total)
- Iteration 4: +2-5 chunks (29-45 total)

Best Practices

For Users

Ask complete questions - “What was revenue and why did it change?” gets better results than two separate questions
Specify time periods - “last 3 quarters” is clearer than “recently”
Trust the mode detection - The agent automatically selects appropriate depth

For Developers

Monitor evaluation scores - Track which dimensions frequently score low
Tune thresholds carefully - Lower thresholds = faster but less thorough
Log follow-up keywords - Understand what additional searches are needed
Watch for infinite loops - Ensure follow-up keywords are progressively different

Get Started

Core Concepts

Features

Guides

Agent System

Overview

Iteration Architecture

Quality Evaluation

Evaluation Dimensions

Overall Confidence Calculation

Evaluation Process

Follow-Up Search Strategy

Why Keyword Phrases?

Keyword Phrase Generation

Parallel Quarter Search

Tool Selection

Iteration Flow

Answer Mode Configuration

Termination Conditions

Practical Examples

Performance Characteristics

Average Iteration Counts

Early Termination Rate

Chunk Accumulation

Best Practices

Next Steps

SEC Agent

Pipeline Stages

Build docs developers (and LLMs) love

Get Started

Core Concepts

Features

Guides

Agent System

​Overview

​Iteration Architecture

​Quality Evaluation

​Evaluation Dimensions

​Overall Confidence Calculation

​Evaluation Process

​Follow-Up Search Strategy

​Why Keyword Phrases?

​Keyword Phrase Generation

​Parallel Quarter Search

​Tool Selection

​Iteration Flow

​Answer Mode Configuration

​Termination Conditions

​Practical Examples

​Performance Characteristics

​Average Iteration Counts

​Early Termination Rate

​Chunk Accumulation

​Best Practices

​Next Steps

SEC Agent

Pipeline Stages

Build docs developers (and LLMs) love

Overview

Iteration Architecture

Quality Evaluation

Evaluation Dimensions

Overall Confidence Calculation

Evaluation Process

Follow-Up Search Strategy

Why Keyword Phrases?

Keyword Phrase Generation

Parallel Quarter Search

Tool Selection

Iteration Flow

Answer Mode Configuration

Termination Conditions

Practical Examples

Performance Characteristics

Average Iteration Counts

Early Termination Rate

Chunk Accumulation

Best Practices

Next Steps