File types

The Search Thing supports indexing multiple file types including text, video, and images. File type configuration is managed through config/file_types.json, which maps file extensions to categories.

Configuration file

File types are defined in config/file_types.json at the project root:

config/file_types.json

{
  "video": [".mp4", ".mov"],
  "image": [".jpeg", ".jpg", ".png", ".webp"],
  "text": [".text", ".txt"]
}

The configuration uses a simple structure:

Keys: Category names (video, image, text)
Values: Arrays of file extensions (including the leading dot)

Supported categories

Text files

Text files are indexed with full-text semantic search capabilities. Each file is:

Read as UTF-8 text content
Split into chunks if needed
Embedded using OpenAI’s embedding models
Stored in the Helix database for semantic search

"text": [".text", ".txt"]

Currently only .txt and .text files are configured. You can add other plain text formats like .md, .log, .csv, etc.

Video files

Video files are processed using multimodal AI to extract visual and audio information:

"video": [".mp4", ".mov"]

The video indexing process (see backend/services/indexing.py:315-402):

Computes content hash to detect duplicates
Extracts frames at regular intervals
Generates thumbnails stored in videos/thumbnails/
Creates embeddings for each frame
Stores metadata and embeddings in the database

Video indexing is resource-intensive and may take significant time for large files or directories.

Image files

Image files are indexed using vision models:

"image": [".jpeg", ".jpg", ".png", ".webp"]

Image indexing (see backend/services/indexing.py:404-447):

Computes content hash for deduplication
Generates visual embeddings
Stores in the database for similarity search

How file types are loaded

The indexing service loads file types at runtime from the configuration file:

backend/services/indexing.py

def _load_extension_to_category() -> dict[str, str]:
    """Load file_types.json; returns mapping ext -> category e.g. {'.mp4': 'video'}.
    Expects extensions to be lowercase with a leading '.'."""
    path = CONFIG_DIR / "file_types.json"
    try:
        with path.open(encoding="utf-8") as f:
            data = json.load(f)
    except (FileNotFoundError, json.JSONDecodeError) as e:
        logger.warning("Could not load file_types.json: %s", e)
        return {}
    ext_to_category: dict[str, str] = {}
    for category, ext_list in data.items():
        if not isinstance(ext_list, list):
            continue
        for ext in ext_list:
            ext_to_category[ext] = category
    return ext_to_category

The function:

Returns an empty dict if the file is missing or invalid JSON
Creates a flattened mapping: {".mp4": "video", ".jpg": "image", ...}
Expects extensions to be lowercase with a leading dot

Adding new file types

To add support for new file types:

1. Update the configuration

Edit config/file_types.json to include your extensions:

{
  "video": [".mp4", ".mov", ".avi", ".mkv"],
  "image": [".jpeg", ".jpg", ".png", ".webp", ".gif", ".bmp"],
  "text": [".text", ".txt", ".md", ".log", ".json", ".py", ".js"]
}

All extensions must:

Start with a dot (.)
Be lowercase
Be unique (each extension can only belong to one category)

2. Restart the backend

The configuration is loaded when the indexing job starts, so restart the backend server:

python backend/app.py

3. Reindex your files

Run a new indexing job to process files with the new extensions:

curl "http://localhost:8000/api/index?dir=/path/to/your/files"

Extension matching rules

When indexing a directory, the system:

Normalizes extensions to lowercase (see backend/services/indexing.py:100-104):

def _normalize_extension(ext: str) -> str:
    ext = ext.strip().lower()
    if not ext:
        return ext
    return ext if ext.startswith(".") else f".{ext}"

Walks the directory tree and matches files by extension
Groups files by category:
- Text files are indexed in batches
- Video files are indexed sequentially
- Image files are indexed in batches
Applies ignore rules before indexing (see Ignore rules)

Performance considerations

Text files

Fast indexing
Processed in configurable batches (default: 10 files)
Low memory usage

Video files

Slow indexing (frame extraction and encoding)
Processed sequentially to manage resources
High CPU and memory usage
Consider indexing videos separately during off-peak hours

Image files

Moderate indexing speed
Processed in batches
Moderate memory usage

Be cautious when adding binary or compressed formats (e.g., .zip, .pdf, .docx) to the text category. These require special parsing logic not currently implemented.

Troubleshooting

Files not being indexed

Check:

Extension is defined in config/file_types.json
Extension uses correct format (lowercase, starts with .)
File is not in the ignore list (see Ignore rules)
Backend logs for errors: Could not load file_types.json

Invalid JSON error

If you see Could not load file_types.json in the logs:

Validate your JSON syntax using a tool like JSONLint
Ensure all strings use double quotes (")
Check for trailing commas (not allowed in JSON)

Unknown category

The system currently only supports text, video, and image categories. If you add a new category:

It will be loaded but ignored during indexing
You need to implement indexing logic in backend/services/indexing.py

Source code reference

File type handling is implemented in:

backend/services/indexing.py:21-49 - Loading and parsing configuration
backend/services/indexing.py:40-49 - Category filtering functions
backend/services/indexing.py:52-64 - File collection by extension

Get Started

Core Features

Configuration

Architecture

Configuration file

Supported categories

Text files

Video files

Image files

How file types are loaded

Adding new file types

1. Update the configuration

2. Restart the backend

3. Reindex your files

Extension matching rules

Performance considerations

Text files

Video files

Image files

Troubleshooting

Files not being indexed

Invalid JSON error

Unknown category

Source code reference

Build docs developers (and LLMs) love

Get Started

Core Features

Configuration

Architecture

​Configuration file

​Supported categories

​Text files

​Video files

​Image files

​How file types are loaded

​Adding new file types

​1. Update the configuration

​2. Restart the backend

​3. Reindex your files

​Extension matching rules

​Performance considerations

​Text files

​Video files

​Image files

​Troubleshooting

​Files not being indexed

​Invalid JSON error

​Unknown category

​Source code reference

Build docs developers (and LLMs) love

Configuration file

Supported categories

Text files

Video files

Image files

How file types are loaded

Adding new file types

1. Update the configuration

2. Restart the backend

3. Reindex your files

Extension matching rules

Performance considerations

Text files

Video files

Image files

Troubleshooting

Files not being indexed

Invalid JSON error

Unknown category

Source code reference