Skip to main content

Overview

When documents are fed to Vespa, they go through an indexing pipeline that transforms and processes them before storage. The pipeline consists of:
  • Indexing Language - Declarative expressions for field transformations
  • Document Processors - Custom Java components for complex processing
  • Indexing Pipeline - The complete flow from ingestion to storage

Indexing Language

The indexing language is a domain-specific language for transforming document fields during indexing.

Basic Syntax

Define indexing statements in your schema:
schema music {
    document music {
        field title type string {
            indexing: summary | index
        }
        
        field artist type string {
            indexing: summary | attribute
        }
        
        field year type int {
            indexing: summary | attribute
        }
    }
}

Indexing Expressions

The indexing language supports various expressions for field manipulation:

Input Expression

Read a field value from the document:
field my_field type string {
    indexing: input title | lowercase | index
}
Reference: indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/expressions/InputExpression.java:19

Output Expressions

Specify where to store the processed value:
Store in memory index for full-text search:
field title type string {
    indexing: input title | lowercase | index
}
Store in in-memory attribute for fast access, filtering, and sorting:
field year type int {
    indexing: attribute
}
Store in document summary for retrieval:
field description type string {
    indexing: summary
}
Reference: indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/expressions/IndexExpression.java:7

Transformation Expressions

The indexing language provides 87+ built-in expressions for data transformation:
field normalized_title type string {
    indexing: input title | lowercase | trim | normalize | index
}

field tokens type array<string> {
    indexing: input text | tokenize | index
}

Common Expressions

ExpressionDescriptionExample
inputRead field valueinput title
lowercaseConvert to lowercaselowercase
tokenizeSplit into tokenstokenize
normalizeUnicode normalizationnormalize
trimRemove whitespacetrim
indexStore in indexindex
attributeStore as attributeattribute
summaryInclude in summarysummary
embedGenerate embeddingsembed embedder_name
flattenFlatten nested structuresflatten
for_eachProcess array elementsfor_each { ... }

Control Flow

Choice Expression

Conditional processing based on field presence:
field display_title type string {
    indexing: (input title || input name || "Untitled") | summary
}

ForEach Expression

Process array elements:
field normalized_tags type array<string> {
    indexing: input tags | for_each { lowercase | trim } | index
}

Script Expressions

Chain multiple operations:
field processed_text type string {
    indexing: input raw_text | 
              lowercase | 
              trim | 
              tokenize | 
              normalize | 
              index | 
              summary
}

Embedding Generation

Generate embeddings during indexing:
schema doc {
    document doc {
        field text type string {
            indexing: summary | index
        }
    }
    
    field embedding type tensor<float>(x[384]) {
        indexing: input text | embed embedder | attribute
    }
}
The embed expression requires configuring an embedder in your services.xml.

Document Processors

Document processors are Java components that perform custom processing on documents before they’re indexed.

Creating a Document Processor

Extend DocumentProcessor and implement the process method:
import com.yahoo.docproc.DocumentProcessor;
import com.yahoo.docproc.Processing;
import com.yahoo.document.DocumentPut;
import com.yahoo.document.Document;

public class MusicEnricherProcessor extends DocumentProcessor {
    
    @Override
    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                DocumentPut put = (DocumentPut) op;
                Document doc = put.getDocument();
                
                // Enrich document
                enrichDocument(doc);
            }
        }
        return Progress.DONE;
    }
    
    private void enrichDocument(Document doc) {
        String artist = (String) doc.getFieldValue("artist");
        if (artist != null) {
            // Add normalized artist field
            doc.setFieldValue("artist_normalized", 
                artist.toLowerCase().trim());
        }
    }
}
Reference: docproc/src/main/java/com/yahoo/docproc/DocumentProcessor.java:45

Processing Return Values

Document processors return a Progress value indicating the outcome:
// Processing completed successfully
return Progress.DONE;
Reference: docproc/src/main/java/com/yahoo/docproc/DocumentProcessor.java:108-150

Accessing Document Operations

The Processing object contains all document operations:
import com.yahoo.docproc.Processing;
import com.yahoo.document.DocumentOperation;
import com.yahoo.document.DocumentPut;
import com.yahoo.document.DocumentUpdate;
import com.yahoo.document.DocumentRemove;

@Override
public Progress process(Processing processing) {
    for (DocumentOperation op : processing.getDocumentOperations()) {
        if (op instanceof DocumentPut) {
            DocumentPut put = (DocumentPut) op;
            processPut(put.getDocument());
        } else if (op instanceof DocumentUpdate) {
            DocumentUpdate update = (DocumentUpdate) op;
            processUpdate(update);
        } else if (op instanceof DocumentRemove) {
            DocumentRemove remove = (DocumentRemove) op;
            processRemove(remove.getId());
        }
    }
    return Progress.DONE;
}
Reference: docproc/src/main/java/com/yahoo/docproc/Processing.java:204-207

Context Variables

Store and retrieve context data across processors:
@Override
public Progress process(Processing processing) {
    // Set context variable
    processing.setVariable("start_time", System.currentTimeMillis());
    
    // Get context variable
    Long startTime = (Long) processing.getVariable("start_time");
    
    // Check if variable exists
    if (processing.hasVariable("user_id")) {
        String userId = (String) processing.getVariable("user_id");
    }
    
    return Progress.DONE;
}
Reference: docproc/src/main/java/com/yahoo/docproc/Processing.java:140-176

Asynchronous Processing

For operations requiring external calls:
import java.util.concurrent.CompletableFuture;

public class AsyncEnricherProcessor extends DocumentProcessor {
    
    @Override
    public Progress process(Processing processing) {
        for (DocumentOperation op : processing.getDocumentOperations()) {
            if (op instanceof DocumentPut) {
                Document doc = ((DocumentPut) op).getDocument();
                
                // Check if already processed
                if (processing.hasVariable("enriched_" + doc.getId())) {
                    continue;
                }
                
                // Start async enrichment
                String artist = (String) doc.getFieldValue("artist");
                CompletableFuture<ArtistInfo> future = 
                    fetchArtistInfo(artist);
                    
                future.whenComplete((info, error) -> {
                    if (error == null) {
                        doc.setFieldValue("genre", info.getGenre());
                        processing.setVariable("enriched_" + doc.getId(), true);
                    }
                });
                
                // Return LATER to be called again
                return Progress.LATER;
            }
        }
        return Progress.DONE;
    }
}
When returning Progress.LATER, the processor will be called again. Ensure you track state to avoid infinite loops.

Configuring Document Processors

Define processors in services.xml:
<services version="1.0">
    <container version="1.0" id="default">
        <document-processing>
            <chain id="default" inherits="indexing">
                <documentprocessor id="com.example.MusicEnricherProcessor"/>
                <documentprocessor id="com.example.ValidationProcessor"/>
            </chain>
        </document-processing>
        
        <nodes>
            <node hostalias="node1"/>
        </nodes>
    </container>
</services>

Multiple Processing Chains

Create different chains for different document types:
<document-processing>
    <chain id="music-chain" inherits="indexing">
        <documentprocessor id="com.example.MusicEnricherProcessor"/>
    </chain>
    
    <chain id="user-chain" inherits="indexing">
        <documentprocessor id="com.example.UserValidationProcessor"/>
    </chain>
</document-processing>

Indexing Pipeline

The complete indexing flow:
1
Document Reception
2
Vespa receives the document via feed client or HTTP API.
3
Document Processing
4
Document processors in the chain execute sequentially:
5
Document → Processor 1 → Processor 2 → ... → Processor N
6
Indexing Language Execution
7
Field-level transformations defined in the schema are applied.
8
Storage
9
Processed document is stored:
10
  • Fields marked index go to memory index
  • Fields marked attribute go to attribute storage
  • Fields marked summary go to document summary
  • Error Handling

    Handle errors in document processors:
    @Override
    public Progress process(Processing processing) {
        try {
            for (DocumentOperation op : processing.getDocumentOperations()) {
                validateOperation(op);
            }
            return Progress.DONE;
        } catch (ValidationException e) {
            log.warning("Validation failed: " + e.getMessage());
            return Progress.FAILED.withReason(e.getMessage());
        } catch (Exception e) {
            log.severe("Unexpected error: " + e.getMessage());
            return Progress.PERMANENT_FAILURE;
        }
    }
    

    Timeouts

    Monitor and enforce timeouts:
    import java.time.Duration;
    
    @Override
    public Progress process(Processing processing) {
        Duration timeLeft = processing.timeLeft();
        
        if (timeLeft.toMillis() < 1000) {
            log.warning("Processing timeout approaching");
            return Progress.TIMEOUT;
        }
        
        // Process with remaining time
        return Progress.DONE;
    }
    
    Reference: docproc/src/main/java/com/yahoo/docproc/Processing.java:232-237

    Best Practices

    1
    Keep Indexing Expressions Simple
    2
    Use indexing language for simple transformations. Move complex logic to document processors:
    3
    // Good: Simple transformation
    field title type string {
        indexing: input title | lowercase | index
    }
    
    // Complex logic → Use document processor instead
    
    4
    Make Processors Stateless
    5
    Document processors must be thread-safe. Avoid mutable instance variables:
    6
    public class SafeProcessor extends DocumentProcessor {
        // Good: Immutable configuration
        private final String apiEndpoint;
        
        // Bad: Mutable state
        // private int counter;
        
        @Override
        public Progress process(Processing processing) {
            // Use local variables for state
            int localCounter = 0;
            return Progress.DONE;
        }
    }
    
    7
    Handle Async Operations Properly
    8
    Track async operation state to avoid reprocessing:
    9
    if (!processing.hasVariable("async_started")) {
        // Start async operation
        startAsyncOperation();
        processing.setVariable("async_started", true);
        return Progress.LATER;
    }
    
    10
    Use Appropriate Progress Codes
    11
    Return the correct progress code:
    12
  • DONE - Processing complete
  • LATER - Need more time (async operation)
  • FAILED - This document failed (temporary)
  • PERMANENT_FAILURE - Critical error (disables processor)
  • See Also

    Build docs developers (and LLMs) love