Skip to main content
The Search Thing supports indexing multiple file types including text, video, and images. File type configuration is managed through config/file_types.json, which maps file extensions to categories.

Configuration file

File types are defined in config/file_types.json at the project root:
config/file_types.json
{
  "video": [".mp4", ".mov"],
  "image": [".jpeg", ".jpg", ".png", ".webp"],
  "text": [".text", ".txt"]
}
The configuration uses a simple structure:
  • Keys: Category names (video, image, text)
  • Values: Arrays of file extensions (including the leading dot)

Supported categories

Text files

Text files are indexed with full-text semantic search capabilities. Each file is:
  1. Read as UTF-8 text content
  2. Split into chunks if needed
  3. Embedded using OpenAI’s embedding models
  4. Stored in the Helix database for semantic search
"text": [".text", ".txt"]
Currently only .txt and .text files are configured. You can add other plain text formats like .md, .log, .csv, etc.

Video files

Video files are processed using multimodal AI to extract visual and audio information:
"video": [".mp4", ".mov"]
The video indexing process (see backend/services/indexing.py:315-402):
  1. Computes content hash to detect duplicates
  2. Extracts frames at regular intervals
  3. Generates thumbnails stored in videos/thumbnails/
  4. Creates embeddings for each frame
  5. Stores metadata and embeddings in the database
Video indexing is resource-intensive and may take significant time for large files or directories.

Image files

Image files are indexed using vision models:
"image": [".jpeg", ".jpg", ".png", ".webp"]
Image indexing (see backend/services/indexing.py:404-447):
  1. Computes content hash for deduplication
  2. Generates visual embeddings
  3. Stores in the database for similarity search

How file types are loaded

The indexing service loads file types at runtime from the configuration file:
backend/services/indexing.py
def _load_extension_to_category() -> dict[str, str]:
    """Load file_types.json; returns mapping ext -> category e.g. {'.mp4': 'video'}.
    Expects extensions to be lowercase with a leading '.'."""
    path = CONFIG_DIR / "file_types.json"
    try:
        with path.open(encoding="utf-8") as f:
            data = json.load(f)
    except (FileNotFoundError, json.JSONDecodeError) as e:
        logger.warning("Could not load file_types.json: %s", e)
        return {}
    ext_to_category: dict[str, str] = {}
    for category, ext_list in data.items():
        if not isinstance(ext_list, list):
            continue
        for ext in ext_list:
            ext_to_category[ext] = category
    return ext_to_category
The function:
  • Returns an empty dict if the file is missing or invalid JSON
  • Creates a flattened mapping: {".mp4": "video", ".jpg": "image", ...}
  • Expects extensions to be lowercase with a leading dot

Adding new file types

To add support for new file types:

1. Update the configuration

Edit config/file_types.json to include your extensions:
{
  "video": [".mp4", ".mov", ".avi", ".mkv"],
  "image": [".jpeg", ".jpg", ".png", ".webp", ".gif", ".bmp"],
  "text": [".text", ".txt", ".md", ".log", ".json", ".py", ".js"]
}
All extensions must:
  • Start with a dot (.)
  • Be lowercase
  • Be unique (each extension can only belong to one category)

2. Restart the backend

The configuration is loaded when the indexing job starts, so restart the backend server:
python backend/app.py

3. Reindex your files

Run a new indexing job to process files with the new extensions:
curl "http://localhost:8000/api/index?dir=/path/to/your/files"

Extension matching rules

When indexing a directory, the system:
  1. Normalizes extensions to lowercase (see backend/services/indexing.py:100-104):
    def _normalize_extension(ext: str) -> str:
        ext = ext.strip().lower()
        if not ext:
            return ext
        return ext if ext.startswith(".") else f".{ext}"
    
  2. Walks the directory tree and matches files by extension
  3. Groups files by category:
    • Text files are indexed in batches
    • Video files are indexed sequentially
    • Image files are indexed in batches
  4. Applies ignore rules before indexing (see Ignore rules)

Performance considerations

Text files

  • Fast indexing
  • Processed in configurable batches (default: 10 files)
  • Low memory usage

Video files

  • Slow indexing (frame extraction and encoding)
  • Processed sequentially to manage resources
  • High CPU and memory usage
  • Consider indexing videos separately during off-peak hours

Image files

  • Moderate indexing speed
  • Processed in batches
  • Moderate memory usage
Be cautious when adding binary or compressed formats (e.g., .zip, .pdf, .docx) to the text category. These require special parsing logic not currently implemented.

Troubleshooting

Files not being indexed

Check:
  1. Extension is defined in config/file_types.json
  2. Extension uses correct format (lowercase, starts with .)
  3. File is not in the ignore list (see Ignore rules)
  4. Backend logs for errors: Could not load file_types.json

Invalid JSON error

If you see Could not load file_types.json in the logs:
  • Validate your JSON syntax using a tool like JSONLint
  • Ensure all strings use double quotes (")
  • Check for trailing commas (not allowed in JSON)

Unknown category

The system currently only supports text, video, and image categories. If you add a new category:
  • It will be loaded but ignored during indexing
  • You need to implement indexing logic in backend/services/indexing.py

Source code reference

File type handling is implemented in:
  • backend/services/indexing.py:21-49 - Loading and parsing configuration
  • backend/services/indexing.py:40-49 - Category filtering functions
  • backend/services/indexing.py:52-64 - File collection by extension

Build docs developers (and LLMs) love