Skip to main content

Overview

The Knowledge Base feature enables users to upload documents that are automatically parsed, chunked, and vectorized for Retrieval-Augmented Generation (RAG). The system uses PostgreSQL with pgvector extension for vector storage and supports streaming responses via Server-Sent Events (SSE) for real-time AI interactions.
All vectorization operations are processed asynchronously using Redis Streams to handle large documents efficiently.

Supported Document Formats

PDF Documents

Adobe PDF files with text layer support

Word Documents

Microsoft Word (DOCX, DOC)

Text Files

Plain text (TXT) and Markdown (MD)

Max File Size

Up to 50MB per document

Upload and Vectorization Workflow

The knowledge base follows an asynchronous processing pipeline:
1

Upload Document

Users upload a document with optional metadata:
POST /api/knowledgebase/upload
Content-Type: multipart/form-data
Form Parameters:
  • file: Document file (required)
  • name: Custom name (optional, defaults to filename)
  • category: Classification tag (optional, e.g., “Java”, “System Design”)
Validation:
// KnowledgeBaseUploadService.java:38
private static final long MAX_FILE_SIZE = 50 * 1024 * 1024; // 50MB
2

Duplicate Detection

The system calculates a SHA-256 hash to prevent duplicate uploads:
// KnowledgeBaseEntity.java:21-23
@Column(nullable = false, unique = true, length = 64)
private String fileHash;
If a duplicate is detected, the existing knowledge base entry is returned immediately.
3

Content Extraction

Apache Tika parses the document and extracts text:
// KnowledgeBaseUploadService.java:68-71
String content = parseService.parseContent(file);
if (content == null || content.trim().isEmpty()) {
    throw new BusinessException(ErrorCode.INTERNAL_ERROR, "无法从文件中提取文本内容");
}
4

File Storage

Original file is uploaded to S3-compatible storage:
// KnowledgeBaseUploadService.java:74-76
String fileKey = storageService.uploadKnowledgeBase(file);
String fileUrl = storageService.getFileUrl(fileKey);
5

Metadata Persistence

Knowledge base entity is saved with status PENDING:
// KnowledgeBaseEntity.java:64-67
@Enumerated(EnumType.STRING)
private VectorStatus vectorStatus = VectorStatus.PENDING;
6

Vectorization Task

A task is sent to Redis Stream for async processing:
// KnowledgeBaseUploadService.java:82
vectorizeStreamProducer.sendVectorizeTask(savedKb.getId(), content);
The API returns immediately with status PENDING.
7

Background Vectorization

A consumer worker:
  • Updates status to PROCESSING
  • Chunks the document using TokenTextSplitter
  • Generates embeddings for each chunk
  • Stores vectors in PostgreSQL (pgvector)
  • Updates status to COMPLETED or FAILED

Vectorization Status Flow

public enum VectorStatus {
    PENDING,      // Document uploaded, awaiting vectorization
    PROCESSING,   // Chunking and embedding in progress
    COMPLETED,    // Successfully vectorized and stored
    FAILED        // Vectorization failed (check vectorError field)
}

Document Chunking Strategy

Large documents are split into smaller chunks for effective embedding:

Chunking Method

TokenTextSplitter from Spring AISplits text based on token count rather than character count for accurate embedding.

Chunk Metadata

Each chunk stores:
  • Original document ID
  • Chunk index
  • Document metadata (name, category)
  • Embedding vector
Chunk count is tracked in KnowledgeBaseEntity.chunkCount for statistics and debugging.

Category Management

Organize knowledge bases with categories:

List All Categories

GET /api/knowledgebase/categories
Response:
["Java", "System Design", "Databases", "Algorithms"]

Filter by Category

GET /api/knowledgebase/category/{category}

Update Category

PUT /api/knowledgebase/{id}/category
{
  "category": "Spring Framework"
}

RAG Query Flow

The system uses Retrieval-Augmented Generation to answer questions based on uploaded documents:
1

Query Submission

Users submit a question with selected knowledge base IDs:
POST /api/knowledgebase/query
{
  "knowledgeBaseIds": [1, 2, 3],
  "question": "How does Redis handle persistence?"
}
2

Knowledge Base Validation

System validates that all IDs exist and increments question counters:
// KnowledgeBaseQueryService.java:111
countService.updateQuestionCounts(knowledgeBaseIds);
3

Query Rewriting (Optional)

If enabled, the question is rewritten for better retrieval:
# application.yml
app:
  ai:
    rag:
      rewrite:
        enabled: true
User questions are often:
  • Too vague (“tell me about Redis”)
  • Contain typos or colloquialisms
  • Missing key technical terms
The AI rewrites the query to:
  • Add relevant technical keywords
  • Clarify ambiguous terms
  • Optimize for vector similarity search
Example:
Original:  "How to make Redis not lose data?"
Rewritten: "Redis persistence mechanisms: RDB snapshots and AOF append-only file"
// KnowledgeBaseQueryService.java:285-306
private String rewriteQuestion(String question) {
    String rewritePrompt = rewritePromptTemplate.render(variables);
    String rewritten = chatClient.prompt()
        .user(rewritePrompt)
        .call()
        .content();
    return normalized;
}
4

Dynamic Search Parameters

Search parameters adapt based on query length:
// KnowledgeBaseQueryService.java:274-283
private SearchParams resolveSearchParams(String question) {
    int compactLength = question.replaceAll("\\s+", "").length();
    if (compactLength <= shortQueryLength) {
        return new SearchParams(topkShort, minScoreShort);
    }
    if (compactLength <= 12) {
        return new SearchParams(topkMedium, minScoreDefault);
    }
    return new SearchParams(topkLong, minScoreDefault);
}

Short Query

≤4 characterstopK: 20
minScore: 0.18

Medium Query

5-12 characterstopK: 12
minScore: 0.28

Long Query

12 characters
topK: 8
minScore: 0.28
5

Vector Similarity Search

The system performs vector search across selected knowledge bases:
// KnowledgeBaseQueryService.java:260-271
List<Document> docs = vectorService.similaritySearch(
    candidateQuery,
    knowledgeBaseIds,
    queryContext.searchParams().topK(),
    queryContext.searchParams().minScore()
);
Uses pgvector’s cosine similarity:
SELECT * FROM vector_store
WHERE metadata->>'kb_id' IN (1, 2, 3)
ORDER BY embedding <=> query_embedding
LIMIT topK;
6

Effective Hit Validation

For short queries, the system validates that retrieved chunks actually contain the search term:
// KnowledgeBaseQueryService.java:313-333
private boolean hasEffectiveHit(String question, List<Document> docs) {
    if (!isShortTokenQuery(normalized)) {
        return true;
    }
    
    String loweredToken = normalized.toLowerCase();
    for (Document doc : docs) {
        if (text != null && text.toLowerCase().contains(loweredToken)) {
            return true;
        }
    }
    return false;
}
This prevents the AI from generating vague “information not found” responses when vector similarity produces false positives.
7

Context Construction

Retrieved document chunks are merged:
// KnowledgeBaseQueryService.java:122-124
String context = relevantDocs.stream()
    .map(Document::getText)
    .collect(Collectors.joining("\n\n---\n\n"));
8

AI Response Generation

The context and question are sent to the AI model:
// KnowledgeBaseQueryService.java:134-138
String answer = chatClient.prompt()
    .system(systemPrompt)
    .user(userPrompt)
    .call()
    .content();
System Prompt: Instructs the AI to answer based only on provided contextUser Prompt: Template with context and question variables
9

Response Normalization

The answer is validated and normalized:
// KnowledgeBaseQueryService.java:343-352
private String normalizeAnswer(String answer) {
    if (answer == null || answer.isBlank()) {
        return NO_RESULT_RESPONSE;
    }
    if (isNoResultLike(normalized)) {
        return NO_RESULT_RESPONSE;
    }
    return normalized;
}
If the AI indicates “no information found,” a standardized message is returned:
"抱歉,在选定的知识库中未检索到相关信息。请换一个更具体的关键词或补充上下文后再试。"

Streaming SSE Responses

For real-time, typewriter-style responses, use the streaming endpoint:
POST /api/knowledgebase/query/stream
Content-Type: application/json
Accept: text/event-stream
{
  "knowledgeBaseIds": [1, 2],
  "question": "Explain Redis persistence"
}

SSE Response Format

data: Redis

data:  provides

data:  two

data:  main

data:  persistence

data:  mechanisms

data: ...

data: [DONE]
const eventSource = new EventSource(
  '/api/knowledgebase/query/stream',
  {
    method: 'POST',
    body: JSON.stringify({
      knowledgeBaseIds: [1, 2],
      question: 'How does Redis work?'
    })
  }
);

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
  } else {
    appendToChat(event.data);
  }
};

eventSource.onerror = (error) => {
  console.error('SSE error:', error);
  eventSource.close();
};

Listing Knowledge Bases

Retrieve all uploaded knowledge bases:
GET /api/knowledgebase/list?sortBy=uploadedAt&vectorStatus=COMPLETED
Query Parameters:
  • sortBy: Sort field (uploadedAt, name, questionCount)
  • vectorStatus: Filter by status (PENDING, PROCESSING, COMPLETED, FAILED)
Response:
[
  {
    "id": 1,
    "name": "Redis in Action",
    "category": "Databases",
    "originalFilename": "redis_guide.pdf",
    "fileSize": 2048576,
    "uploadedAt": "2026-03-10T10:00:00",
    "vectorStatus": "COMPLETED",
    "chunkCount": 42,
    "questionCount": 15,
    "accessCount": 30
  }
]

Searching Knowledge Bases

Search by filename or content:
GET /api/knowledgebase/search?keyword=redis
Searches across:
  • Knowledge base name
  • Original filename
  • Category tags

Downloading Documents

Retrieve the original uploaded file:
GET /api/knowledgebase/{id}/download
Response Headers:
Content-Disposition: attachment; filename="redis_guide.pdf"
Content-Type: application/pdf

Statistics Dashboard

Get aggregated statistics:
GET /api/knowledgebase/stats
Response:
{
  "totalKnowledgeBases": 25,
  "totalQuestions": 342,
  "totalChunks": 1580,
  "statusBreakdown": {
    "COMPLETED": 22,
    "PENDING": 2,
    "FAILED": 1
  },
  "topCategories": [
    {"category": "Java", "count": 8},
    {"category": "System Design", "count": 6}
  ]
}

Manual Re-vectorization

If vectorization fails, users can retry:
POST /api/knowledgebase/{id}/revectorize
1

Reset Status

Update vectorStatus to PENDING and clear error message
2

Re-download File

Fetch original file from storage and re-parse
3

Re-queue Task

Send new vectorization task to Redis Stream
This endpoint is rate-limited to 2 requests per IP to prevent abuse.

Deleting Knowledge Bases

Remove a knowledge base and all associated vectors:
DELETE /api/knowledgebase/{id}
This operation:
  • Deletes the entity from the database
  • Removes all vector embeddings from pgvector
  • Does not delete the original file from storage (for audit purposes)

Rate Limiting

Protection mechanisms:
// KnowledgeBaseController.java
@RateLimit(dimensions = {RateLimit.Dimension.GLOBAL, RateLimit.Dimension.IP}, count = 3)
public Result<Map<String, Object>> uploadKnowledgeBase(...) { ... }

@RateLimit(dimensions = {RateLimit.Dimension.GLOBAL, RateLimit.Dimension.IP}, count = 10)
public Result<QueryResponse> queryKnowledgeBase(...) { ... }

@RateLimit(dimensions = {RateLimit.Dimension.GLOBAL, RateLimit.Dimension.IP}, count = 5)
public Flux<String> queryKnowledgeBaseStream(...) { ... }

Upload

3 uploads per window

Query

10 queries per window

Streaming

5 streams per window

Error Handling

Status: FAILEDCommon Causes:
  • Document too large for embedding model
  • Invalid UTF-8 encoding
  • AI API rate limit or timeout
  • Database connection failure
Solution: Check vectorError field for details and use manual re-vectorization.
Response: Standard “no information found” messageCauses:
  • Question topic not covered in uploaded documents
  • Query rewriting produced poor keywords
  • Vector similarity threshold too strict
Solution:
  • Rephrase the question with more specific terms
  • Adjust minScore parameters (requires config change)
  • Upload more relevant documents
Error: 无法从文件中提取文本内容Causes:
  • Scanned PDF without OCR
  • Corrupted or encrypted file
  • Unsupported document structure
Solution: Convert to text-based format or perform OCR preprocessing.

Best Practices

Optimize Document Structure

Use clear headings and sections. Well-structured documents chunk better and retrieve more accurately.

Use Descriptive Names

Name knowledge bases descriptively (e.g., “Spring Boot 3.0 Official Guide” vs. “doc.pdf”).

Organize with Categories

Assign categories consistently to enable filtered searches and multi-KB queries.

Monitor Chunk Count

If chunkCount is very low (< 5), the document may be too short or poorly parsed.

Poll Vectorization Status

Implement polling (every 3-5 seconds) after upload:
while (kb.vectorStatus !== 'COMPLETED') {
  await sleep(3000);
  kb = await getKnowledgeBase(id);
}

Handle Streaming Errors

Always implement onerror handlers for SSE connections and provide fallback UI.

Architecture Diagram

Configuration Reference

# application.yml
spring:
  ai:
    vectorstore:
      pgvector:
        initialize-schema: true  # Auto-create vector_store table
        index-type: HNSW         # Vector index type
        distance-type: COSINE_DISTANCE
        dimensions: 1536         # Embedding dimension

app:
  ai:
    rag:
      rewrite:
        enabled: true
      search:
        short-query-length: 4
        topk-short: 20
        topk-medium: 12
        topk-long: 8
        min-score-short: 0.18
        min-score-default: 0.28
For complete API reference, see:

Build docs developers (and LLMs) love