Skip to main content

Overview

The agent performs iterative self-improvement until the answer meets quality thresholds. This is a key differentiator from single-pass RAG systems.
The agent evaluates its own answers, identifies gaps, and searches for additional information until confident or max iterations reached.

Iteration Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ITERATION LOOP                                │
│                                                                  │
│  ┌──────────────────┐                                           │
│  │ Generate Answer  │◄──────────────────────────────────┐       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐                                   │       │
│  │ Evaluate Quality │                                   │       │
│  │ • completeness   │                                   │       │
│  │ • specificity    │                                   │       │
│  │ • accuracy       │                                   │       │
│  │ • vs. reasoning  │ ← Checks if reasoning goals met   │       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐    YES    ┌─────────────────┐    │       │
│  │ Confidence < 90% │─────────► │ Search for more │────┘       │
│  │ & iterations left│           │ context (tools) │            │
│  └────────┬─────────┘           └─────────────────┘            │
│           │ NO                                                  │
│           ▼                                                     │
│     ┌───────────┐                                               │
│     │  OUTPUT   │                                               │
│     └───────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Quality Evaluation

The agent uses a strict multi-dimensional scoring system (0-100 scale) to evaluate answer quality.

Evaluation Dimensions

Question: Does the answer fully address the question?Criteria:
  • All parts of the question answered
  • No critical information missing
  • Reasoning goals from Stage 2 met
Example:
Question: "What was Apple's Q4 revenue and how does it compare to Q3?"

Answer A (Score: 50):
"Apple's Q4 revenue was $89.5B."
↓ Missing comparison to Q3

Answer B (Score: 100):
"Apple's Q4 revenue was $89.5B, up from $85.8B in Q3, 
representing a 4.3% sequential increase."
↓ Complete answer
Question: Does it include specific numbers, quotes, and details?Criteria:
  • Exact financial figures (not “increased significantly”)
  • Direct quotes from transcripts/filings
  • Specific time periods (Q1 2025, not “recently”)
  • Proper citations
Example:
Low Specificity (Score: 40):
"Management was optimistic about AI."

High Specificity (Score: 95):
"CEO Tim Cook stated 'we see AI as transformative for our 
products' and noted AI features drove a 12% increase in iPhone 
upgrades [1]."
Question: Is the information factually correct based on sources?Criteria:
  • Numbers match source material exactly
  • No hallucinated information
  • Calculations performed correctly
  • Citations point to correct sources
Verification:
  • Cross-reference numbers against chunks
  • Validate calculations (e.g., growth rates)
  • Ensure quotes are verbatim
Question: Is the response well-structured and easy to understand?Criteria:
  • Logical organization
  • Clear language (avoid jargon where possible)
  • Proper formatting (bullets, sections)
  • Smooth flow between ideas

Overall Confidence Calculation

overall_confidence = (
    0.35 * completeness_score +
    0.30 * specificity_score +
    0.25 * accuracy_score +
    0.10 * clarity_score
) / 100  # Normalize to 0-1 scale
Weights prioritize completeness and specificity, as these are most critical for financial Q&A.

Evaluation Process

1

Generate evaluation prompt

def evaluate_answer_quality(question, answer, reasoning, chunks):
    """
    LLM evaluates answer against:
    - Original question
    - Research reasoning (from Stage 2)
    - Available source material (chunks)
    """
2

LLM scores each dimension

Returns structured evaluation:
{
  "completeness_score": 85,
  "specificity_score": 90,
  "accuracy_score": 95,
  "clarity_score": 88,
  "overall_confidence": 0.89,
  "issues": [
    "Missing year-over-year comparison"
  ],
  "missing_info": [
    "Prior year Q4 revenue figure"
  ],
  "suggestions": [
    "Search for Q4 2023 revenue data"
  ]
}
3

Check against threshold

Compare to answer mode threshold:
  • direct: 0.70
  • standard: 0.80
  • detailed: 0.90
  • deep_search: 0.95
4

Stream evaluation to frontend

Event type: agent_decision
{
  "type": "agent_decision",
  "message": "Answer quality: 89%. Missing YoY comparison.",
  "data": {
    "confidence": 0.89,
    "issues": ["Missing year-over-year comparison"],
    "will_iterate": true
  }
}

Follow-Up Search Strategy

When the agent needs more information, it generates search-optimized keyword phrases (not verbose questions).

Why Keyword Phrases?

OLD Approach (verbose questions):
❌ "What specific revenue growth percentage was reported and how does
   it compare to the previous quarter?"
❌ "Did executives provide updated capex guidance for 2025, and what
   portion was specifically tied to AI?"
Issues:
  • Question framing adds noise (“What”, “How”, “Did”)
  • Doesn’t match vector search patterns
  • Too verbose for semantic similarity

Keyword Phrase Generation

def generate_followup_keywords(evaluation, reasoning, question):
    """
    Input:
        evaluation.missing_info = ["Prior year Q4 revenue figure"]
        evaluation.suggestions = ["Search for Q4 2023 revenue data"]
    
    Output:
        [
            "Q4 2023 revenue quarterly results",
            "year over year revenue comparison 2024 2023"
        ]
    
    Characteristics:
    - Declarative keyword phrases (not questions)
    - Include temporal scope (Q4 2023, 2024 vs 2023)
    - Extract entities (revenue, quarterly results)
    - Remove question words (what, how, did)
    """
Each keyword phrase searches ALL target quarters in parallel:

Tool Selection

The agent can request additional searches from any data source:

Iteration Flow

1

Iteration N starts

Stream event: iteration_start
{
  "type": "iteration_start",
  "data": {"iteration": 2, "max_iterations": 3}
}
2

Evaluate previous answer

LLM scores quality and identifies gaps
3

Check termination conditions

if confidence >= threshold:
    return answer  # Early termination

if iteration >= max_iterations:
    return answer  # Max iterations reached

if no followup_keywords:
    return answer  # Agent satisfied
4

Generate follow-up keywords

Stream event: iteration_followup
{
  "type": "iteration_followup",
  "message": "Searching for: capex guidance, margin trends",
  "data": {
    "keywords": ["capex guidance 2025", "margin trends profitability"]
  }
}
5

Execute searches

  • Parallel quarter search for each keyword
  • Optional news search
  • Deduplicate and merge new chunks
6

Stream new chunks found

Event: iteration_search
{
  "type": "iteration_search",
  "message": "Found 8 new chunks",
  "data": {"chunks_added": 8}
}
7

Regenerate answer

Use ALL accumulated chunks (previous + new)
8

Mark iteration complete

Event: iteration_complete
{
  "type": "iteration_complete",
  "data": {
    "iteration": 2,
    "confidence": 0.91,
    "will_continue": false
  }
}

Answer Mode Configuration

Different answer modes balance speed vs. thoroughness:
Configuration:
{
    "max_iterations": 2,
    "max_tokens": 2000,
    "confidence_threshold": 0.7,
}
When Used:
  • Quick factual lookups
  • Single-number questions
  • Time-sensitive queries
Example Questions:
  • “What was Q4 revenue?”
  • “When is the next earnings call?”
  • “What’s the current stock price target?”
Typical Performance:
  • 1-2 iterations
  • ~5-8 seconds

Termination Conditions

The iteration loop stops when any of these conditions are met:
if overall_confidence >= threshold:
    logger.info(f"Confidence {confidence:.0%} >= {threshold:.0%}, stopping")
    return answer
Threshold by mode:
  • Direct: 70%
  • Standard: 80%
  • Detailed: 90%
  • Deep Search: 95%
if iteration >= max_iterations:
    logger.info(f"Max iterations ({max_iterations}) reached")
    return answer
Limits by mode:
  • Direct: 2
  • Standard: 3
  • Detailed: 4
  • Deep Search: 10
if evaluation.get("is_sufficient", False):
    logger.info("Agent determined answer is sufficient")
    return answer
Agent can decide answer is good enough even below threshold.
if not followup_keywords:
    logger.info("No additional searches identified")
    return answer
If agent can’t generate meaningful follow-up searches, stop.

Practical Examples

Question: “What was Apple’s Q4 2024 revenue?”Mode: Direct (max 2 iterations, 70% threshold)
Iteration 1:
Initial search: "Q4 2024 revenue quarterly results"
Retrieved: 12 chunks

Answer: "Apple's Q4 2024 revenue was $94.9B [1]."

Evaluation:
- Completeness: 95 (direct answer to question)
- Specificity: 100 (exact number with citation)
- Accuracy: 100 (matches source)
- Clarity: 95 (clear and concise)
- Overall: 0.97 >= 0.70 ✓
Result: 1 iteration, ~6 seconds, early termination

Performance Characteristics

Average Iteration Counts

Answer ModeAvg IterationsAvg Time
Direct1.3~6s
Standard2.1~12s
Detailed2.8~18s
Deep SearchN/AN/A

Early Termination Rate

~65% of questions terminate early (before max iterations) due to confidence threshold being met.

Chunk Accumulation

Typical chunk counts:
- Iteration 1: 15-20 chunks
- Iteration 2: +8-12 chunks (23-32 total)
- Iteration 3: +4-8 chunks (27-40 total)
- Iteration 4: +2-5 chunks (29-45 total)

Best Practices

  • Ask complete questions - “What was revenue and why did it change?” gets better results than two separate questions
  • Specify time periods - “last 3 quarters” is clearer than “recently”
  • Trust the mode detection - The agent automatically selects appropriate depth
  • Monitor evaluation scores - Track which dimensions frequently score low
  • Tune thresholds carefully - Lower thresholds = faster but less thorough
  • Log follow-up keywords - Understand what additional searches are needed
  • Watch for infinite loops - Ensure follow-up keywords are progressively different

Next Steps

SEC Agent

Learn about the specialized 10-K agent’s iterative retrieval

Pipeline Stages

Review the complete 6-stage pipeline

Build docs developers (and LLMs) love