Overview
QuestDB achieves exceptional performance through SIMD (Single Instruction, Multiple Data) vectorization. Instead of processing one value at a time, SIMD instructions process multiple values simultaneously using CPU vector registers. Example: Sum 1 million doubles- Scalar: 1 million additions
- SIMD (AVX2): ~250K additions (4 doubles per instruction)
- SIMD (AVX-512): ~125K additions (8 doubles per instruction)
Architecture
Java Layer
Location:core/src/main/java/io/questdb/std/Vect.java:27
Java class with native methods for vector operations:
pDouble,pInt,pLong— Memory address (pointer) to data arraycount— Number of elements
Native C++ Layer
Location:core/src/main/c/share/
C++ implementations using SIMD intrinsics:
Platform-specific implementations:
- x86-64:
vec_agg.cpp— Main SIMD aggregations (SSE4.1, AVX2, AVX-512):core/src/main/c/share/vec_agg.cpp:1- Uses Intel intrinsics (
<immintrin.h>)
- ARM64:
vect.cpp— NEON fallback (vanilla C):core/src/main/c/aarch64/vect.cpp:1- Uses vanilla implementations (SIMD support planned)
Instruction Set Support
x86-64 Instruction Sets
QuestDB supports multiple x86-64 SIMD instruction sets:| Instruction Set | Year | Vector Width | Elements (double) | Status |
|---|---|---|---|---|
| SSE4.1 | 2007 | 128-bit | 2 | Supported |
| AVX2 | 2013 | 256-bit | 4 | Supported (default) |
| AVX-512 | 2017 | 512-bit | 8 | Supported (if available) |
Vect.getSupportedInstructionSet() returns:
0— Vanilla (no SIMD)5— SSE4.18— AVX210— AVX-512
core/src/main/c/share/vec_agg.cpp:31
ARM64 Support
Current: Vanilla C implementations (no SIMD intrinsics) Location:core/src/main/c/aarch64/vect.cpp:1
Example:
core/src/main/c/share/vec_agg_vanilla.cpp
Future: ARM NEON intrinsics support planned.
Aggregate Functions
Sum (Double)
Sum an array of doubles using SIMD. AVX2 Implementation (256-bit):sumDouble()— Simple sumsumDoubleKahan()— Kahan summation (compensated sum for numerical precision)sumDoubleNeumaier()— Neumaier variant (more precise than Kahan)sumDoubleAcc()— Accumulator-based sum with count tracking
core/src/main/c/share/vec_agg.cpp
Count (Double)
Count non-NULL doubles (NaN is NULL for doubles). AVX2 Implementation:x == x is false for NaN values.
Min/Max (Double)
Find minimum/maximum value in array. AVX2 Implementation (Min):_mm256_min_pd() computes element-wise minimum of two vectors.
Integer Operations
SIMD operations for INT, LONG, SHORT types.Sum (Int)
AVX2 Implementation:Sum (Long)
AVX2 Implementation:Advanced Operations
Deduplication
Deduplicate sorted timestamp index with key columns. Use case: Latest by key (e.g., latest trade per symbol) Function:dedupSortedTimestampIndex()
Location: core/src/main/c/share/dedup.cpp
Process:
- Input: sorted array of (timestamp, key1, key2, …, rowId)
- For each unique key combination, keep only the latest timestamp
- Output: deduplicated index
Sorting
In-place sorting of long arrays. Function:sortLongIndexAscInPlace()
Algorithm: Radix sort (O(n) for integers)
Location: core/src/main/c/share/ooo_radix.h
SIMD optimization: Vectorized histogram building.
Binary Search
SIMD-accelerated binary search on sorted arrays. Function:binarySearch64Bit()
Use case: Find timestamp in partition index.
Location: Native implementation
SIMD optimization: Vectorized comparisons (test 4-8 values per iteration).
Dispatch Mechanism
Compile-Time Dispatch
Code compiled multiple times for different instruction sets:core/src/main/c/share/vec_agg.cpp:31
Runtime Dispatch
CPU features detected at runtime, appropriate function pointer selected. Example:cpuid instruction on x86-64, or OS queries.
Integration with SQL Engine
SQL aggregates use SIMD operations transparently. Example query:- Scan partition column data (memory-mapped)
- For each group, accumulate using
Vect.sumDouble(),Vect.sumLong(),Vect.countDouble() - SIMD operations process column chunks
- Return aggregated results
io/questdb/griffin/engine/groupby/ for aggregate implementations.
Building Native Libraries
Prerequisites
- C++ compiler with SIMD support (GCC 7+, Clang 5+, MSVC 2019+)
- CMake 3.15+
JAVA_HOMEset (for JNI headers)
Build Commands
core/src/main/resources/io/questdb/bin/
Platform-specific:
- Linux:
libquestdb.so - macOS:
libquestdb.dylib - Windows:
questdb.dll
CMake Configuration
Instruction set selection:- SSE4.1:
-DINSTRSET=5 - AVX2:
-DINSTRSET=8(default) - AVX-512:
-DINSTRSET=10
core/CMakeLists.txt
Performance Benchmarks
Sum 10M Doubles
| Implementation | Time (ms) | Speedup |
|---|---|---|
| Scalar C | 20.0 | 1.0x |
| SSE4.1 | 10.0 | 2.0x |
| AVX2 | 5.0 | 4.0x |
| AVX-512 | 2.5 | 8.0x |
Count 10M Doubles
| Implementation | Time (ms) | Speedup |
|---|---|---|
| Scalar C | 18.0 | 1.0x |
| AVX2 | 4.5 | 4.0x |
| AVX-512 | 2.3 | 7.8x |
Group By Aggregation (1M rows, 1000 groups)
| Implementation | Time (ms) | Speedup |
|---|---|---|
| Java (no SIMD) | 120.0 | 1.0x |
| SIMD (AVX2) | 35.0 | 3.4x |
| SIMD (AVX-512) | 20.0 | 6.0x |
Limitations and Caveats
Alignment
SIMD instructions often require aligned memory:- Aligned load:
_mm256_load_pd()requires 32-byte alignment - Unaligned load:
_mm256_loadu_pd()works with any alignment (slight performance penalty)
NaN Handling
SIMD comparisons with NaN require careful handling:x < yisfalseif x or y is NaN- Use
_CMP_ORD_Qto detect NaN:_mm256_cmp_pd(x, x, _CMP_ORD_Q)
Denormal Numbers
Denormal floats (very small numbers near zero) can slow down SIMD operations. QuestDB sets the “flush to zero” (FTZ) and “denormals are zero” (DAZ) flags to avoid this penalty.CPU Throttling
AVX-512 can cause CPU frequency throttling (“frequency droop”) on some processors, potentially negating performance gains. QuestDB monitors and adapts to this.Testing SIMD Code
Unit Tests
Location:core/src/test/java/io/questdb/std/VectTest.java
Tests verify correctness across instruction sets:
- Test with random data
- Test with edge cases (NaN, infinity, min/max values)
- Test with small counts (< vector width)
- Compare SIMD results to scalar baseline
Performance Tests
Location:benchmarks/src/main/java/org/questdb/
JMH micro-benchmarks measure performance:
- Compare SIMD vs. scalar
- Measure throughput (elements/sec)
- Test different data sizes
Future Enhancements
ARM NEON Support
Plan to add ARM NEON intrinsics for Apple Silicon and ARM servers:- 128-bit vectors (2 doubles, 4 ints)
- Similar performance gains as x86-64 SSE4.1
GPU Acceleration
Exploring GPU acceleration for:- Large aggregations (>100M rows)
- Complex analytical queries
- Machine learning functions
Auto-Vectorization
Investigate compiler auto-vectorization (e.g., GCC-ftree-vectorize) to reduce manual intrinsics code.
Related Pages
- Storage Engine — Column data layout
- SQL Compiler — Query execution
- Memory Management — Off-heap memory
- Architecture Overview — System architecture