Why Hybrid Search?
Neither keyword search nor semantic search is perfect alone:| Method | Strengths | Weaknesses |
|---|---|---|
| BM25 | Fast, exact matches, works for rare terms | Misses synonyms, semantic similarity |
| Semantic | Understands meaning, finds related concepts | Slower, may miss exact matches |
| Hybrid | Best of both — fast keyword + semantic understanding | ✅ |
Example: Searching for “authentication middleware” should find both:
- Files containing “auth” (keyword match)
- Files with similar concepts like “validateUser”, “checkToken” (semantic match)
Architecture
BM25 Search
BM25 (Best Match 25) is a probabilistic ranking algorithm for keyword-based search.Implementation
GitNexus uses KuzuDB’s built-in FTS (Full-Text Search) indexes:bm25-index.ts:60
Always fresh: KuzuDB FTS reads from the database on every query — no stale cached indexes.
BM25 Scoring
BM25 ranks documents using term frequency (TF) and inverse document frequency (IDF):- High scores: Documents with rare query terms that appear frequently
- Low scores: Documents with common terms that appear rarely
Semantic Search
Semantic search uses embedding vectors to find code with similar meaning, even if keywords don’t match exactly.Embedding Model
GitNexus uses snowflake-arctic-embed-xs by default:- 22M parameters
- 384 dimensions
- ~90MB model size
- GPU acceleration via DirectML (Windows) or CUDA (Linux)
embedder.ts:113
How embeddings work
How embeddings work
Each symbol (function, class, method) is converted to a 384-dimensional vector:Similar code produces similar vectors (measured by cosine similarity).
Vector Index
Embeddings are stored in KuzuDB as vector properties:Embedding generation is optional: Run
gitnexus analyze --skip-embeddings to index without embeddings (faster, BM25-only search).Reciprocal Rank Fusion (RRF)
RRF merges rankings from multiple sources without needing to normalize scores.Algorithm
For each result at rankr in a result set, compute:
k = 60 (standard constant). If a document appears in both BM25 and semantic results, sum its RRF scores.
hybrid-search.ts:46
Why RRF?
No score normalization needed
No score normalization needed
BM25 scores (0-∞) and cosine similarity (0-1) are on different scales. RRF uses rank position instead of raw scores, avoiding normalization issues.
Robust to outliers
Robust to outliers
A single high BM25 score won’t dominate the results. Rank position is more stable.
Simple and effective
Simple and effective
RRF is a one-line formula with a single parameter (
k = 60). It’s used in production by Elasticsearch, Pinecone, and others.Process-Grouped Search
GitNexus doesn’t just return a flat list of files. Results are grouped by process (execution flow) to provide architectural context.Example Output
Grouping Logic
- Run hybrid search to get relevant symbols
- Find processes that contain those symbols (via
STEP_IN_PROCESSedges) - Rank processes by relevance:
- Sum of RRF scores for symbols in the process
- Normalized by process step count
- Group results by process
Process-grouped search helps agents understand how features work, not just where they’re defined.
MCP Query Tool
The MCPquery tool uses hybrid search under the hood:
query(required) - Search query stringlimit(optional) - Max results (default: 10)repo(optional) - Repository name (required if multiple repos indexed)
processes- Execution flows related to the queryprocess_symbols- Symbols grouped by processdefinitions- Other relevant symbols not in processes
Performance
| Method | Latency | Memory |
|---|---|---|
| BM25 only | ~10ms | Minimal |
| Semantic only | ~50ms | ~200MB (model loaded) |
| Hybrid (RRF) | ~60ms | ~200MB |
GPU acceleration: Semantic search is 5-10x faster on GPU (DirectML/CUDA) compared to CPU.
Example: Searching for “auth”
BM25 Results
Semantic Results
RRF Merged Results
src/auth/index.ts gets the highest score because it appears in both result sets, showing it’s highly relevant by both keyword and semantic criteria.Customization
You can customize the embedding model during indexing:Next Steps
Knowledge Graph
Understand the graph schema
Processes & Flows
Learn how process-grouped search works