The RAG (Retrieval-Augmented Generation) pipeline is the core of the support system’s answer generation capability. It combines semantic search with large language model generation to produce grounded, contextual responses.
Architecture Overview
The SimpleRetrievalAgent orchestrates three main phases:
Retrieve - Semantic search over the knowledge base
Generate - LLM-based answer synthesis
Format - Structured output with citations and metadata
The pipeline only generates answers when relevant context is found. If no chunks are retrieved, it returns a fallback message and flags the case for human review.
Phase 1: Retrieval
The retrieval phase uses Chroma vector store to find relevant document chunks based on semantic similarity.
def retrieve (
self ,
query : str ,
predicted_category : str ,
k : int = 5 ,
) -> List[Dict]:
"""
Retrieve top-K relevant chunks from the vector store.
"""
filters = { "category" : predicted_category}
results = self .vectordb.similarity_search_with_relevance_scores(
query,
k = k,
filter = filters,
)
return [
{
"content" : doc.page_content,
"score" : score,
"metadata" : doc.metadata,
}
for doc, score in results
]
Retrieval is filtered by the predicted category from the triage model, ensuring domain-relevant results.
Configuration
The system uses OpenAI’s text-embedding-3-small model for embeddings:
def build_embeddings () -> OpenAIEmbeddings:
"""Create embeddings client."""
return OpenAIEmbeddings(
model = "text-embedding-3-small" ,
openai_api_key = OPENAI_API_KEY ,
)
Phase 2: Generation
The generation phase uses GPT-4.1 to synthesize a grounded answer from retrieved context.
def generate_answer (
self ,
context : str ,
query : str ,
predicted_category : str ,
priority : str ,
) -> str :
"""
Generate a grounded answer from retrieved context.
"""
prompt = generate_prompt(
predicted_category,
context,
query,
priority,
)
return self .llm.invoke([HumanMessage( content = prompt)]).content
The LLM temperature is set to 0.0 for maximum consistency in support responses.
LLM Configuration
DEFAULT_LLM_MODEL = "gpt-4.1"
DEFAULT_TEMPERATURE = 0.0
def build_llm () -> ChatOpenAI:
"""Create LLM client."""
return ChatOpenAI(
api_key = OPENAI_API_KEY ,
model_name = DEFAULT_LLM_MODEL ,
temperature = DEFAULT_TEMPERATURE ,
)
The final phase structures the output with citations, internal next steps, and review flags.
def format_response (
self ,
answer : str ,
internal_next_steps : List[ str ],
chunks : List[Dict],
needs_human_review : bool ,
) -> Dict:
"""
Build the final structured response.
"""
citations = [
{
"document_name" : c[ "metadata" ].get( "filename" , "unknown" ),
"chunk_id" : c[ "metadata" ].get( "element_id" ),
"snippet" : c[ "content" ][: 35 ],
"full_content" : c[ "content" ],
}
for c in chunks
]
return {
"draft_reply" : answer,
"internal_next_steps" : internal_next_steps,
"citations" : citations,
"needs_human_review" : needs_human_review,
}
End-to-End Flow
The answer() method orchestrates all three phases:
def answer (
self ,
query : str ,
predicted_category : str ,
priority : str ,
confidence : Dict[ str , float ],
k : int = 5 ,
) -> Dict:
"""
End-to-end RAG pipeline.
"""
# Phase 1: Retrieve
chunks = self .retrieve(
query,
predicted_category = predicted_category,
k = k,
)
if not chunks:
return {
"draft_reply" : "Insufficient context. Please clarify your request." ,
"internal_next_steps" : [],
"citations" : [],
"needs_human_review" : True ,
}
context_text = " \n\n " .join(c[ "content" ] for c in chunks)
# Phase 2: Generate
answer = self .generate_answer(
context = context_text,
query = query,
predicted_category = predicted_category,
priority = priority,
)
internal_next_steps = generate_internal_next_steps(
context = context_text,
query = query,
)
# Determine review flag
needs_human_review = (
confidence.get( "category" , 0 ) < CATEGORY_CONF_THRESHOLD
or confidence.get( "priority" , 0 ) < PRIORITY_CONF_THRESHOLD
)
# Phase 3: Format
return self .format_response(
answer = answer,
internal_next_steps = internal_next_steps,
chunks = chunks,
needs_human_review = needs_human_review,
)
Confidence Thresholds Responses are flagged for human review when category confidence < 0.5 or priority confidence < 0.5.
Triage Models Learn how category and priority are predicted
Knowledge Base Explore document ingestion and vector storage
Structured Outputs Understand citations and internal next steps