config/file_types.json, which maps file extensions to categories.
Configuration file
File types are defined inconfig/file_types.json at the project root:
config/file_types.json
- Keys: Category names (
video,image,text) - Values: Arrays of file extensions (including the leading dot)
Supported categories
Text files
Text files are indexed with full-text semantic search capabilities. Each file is:- Read as UTF-8 text content
- Split into chunks if needed
- Embedded using OpenAI’s embedding models
- Stored in the Helix database for semantic search
Currently only
.txt and .text files are configured. You can add other plain text formats like .md, .log, .csv, etc.Video files
Video files are processed using multimodal AI to extract visual and audio information:backend/services/indexing.py:315-402):
- Computes content hash to detect duplicates
- Extracts frames at regular intervals
- Generates thumbnails stored in
videos/thumbnails/ - Creates embeddings for each frame
- Stores metadata and embeddings in the database
Image files
Image files are indexed using vision models:backend/services/indexing.py:404-447):
- Computes content hash for deduplication
- Generates visual embeddings
- Stores in the database for similarity search
How file types are loaded
The indexing service loads file types at runtime from the configuration file:backend/services/indexing.py
- Returns an empty dict if the file is missing or invalid JSON
- Creates a flattened mapping:
{".mp4": "video", ".jpg": "image", ...} - Expects extensions to be lowercase with a leading dot
Adding new file types
To add support for new file types:1. Update the configuration
Editconfig/file_types.json to include your extensions:
All extensions must:
- Start with a dot (
.) - Be lowercase
- Be unique (each extension can only belong to one category)
2. Restart the backend
The configuration is loaded when the indexing job starts, so restart the backend server:3. Reindex your files
Run a new indexing job to process files with the new extensions:Extension matching rules
When indexing a directory, the system:-
Normalizes extensions to lowercase (see
backend/services/indexing.py:100-104): - Walks the directory tree and matches files by extension
-
Groups files by category:
- Text files are indexed in batches
- Video files are indexed sequentially
- Image files are indexed in batches
- Applies ignore rules before indexing (see Ignore rules)
Performance considerations
Text files
- Fast indexing
- Processed in configurable batches (default: 10 files)
- Low memory usage
Video files
- Slow indexing (frame extraction and encoding)
- Processed sequentially to manage resources
- High CPU and memory usage
- Consider indexing videos separately during off-peak hours
Image files
- Moderate indexing speed
- Processed in batches
- Moderate memory usage
Troubleshooting
Files not being indexed
Check:- Extension is defined in
config/file_types.json - Extension uses correct format (lowercase, starts with
.) - File is not in the ignore list (see Ignore rules)
- Backend logs for errors:
Could not load file_types.json
Invalid JSON error
If you seeCould not load file_types.json in the logs:
- Validate your JSON syntax using a tool like JSONLint
- Ensure all strings use double quotes (
") - Check for trailing commas (not allowed in JSON)
Unknown category
The system currently only supportstext, video, and image categories. If you add a new category:
- It will be loaded but ignored during indexing
- You need to implement indexing logic in
backend/services/indexing.py
Source code reference
File type handling is implemented in:backend/services/indexing.py:21-49- Loading and parsing configurationbackend/services/indexing.py:40-49- Category filtering functionsbackend/services/indexing.py:52-64- File collection by extension
