Key Capabilities
Spice combines four industry-standard APIs for building AI applications:- OpenAI-Compatible APIs - HTTP APIs for chat completions and embeddings with SDK compatibility
- Local Model Serving - Run LLMs and embedding models locally with hardware acceleration
- Model Gateway - Connect to hosted providers (OpenAI, Anthropic, xAI, AWS Bedrock, Azure)
- MCP Integration - Tool/function calling via Model Context Protocol (MCP) using HTTP+SSE
Architecture
Spice’s AI-native architecture provides:- Unified Data + AI Runtime: Query data and run inference in a single engine
- OpenAI SDK Compatibility: Drop-in replacement for OpenAI client libraries
- Hardware Acceleration: CUDA and Metal support for local model inference
- Vector Search Integration: Native support for RAG workflows with vector similarity search
- Flexible Deployment: Run as sidecar, microservice, or cluster from edge to cloud
Use Cases
Retrieval-Augmented Generation (RAG)
Combine vector similarity search with LLM inference for context-aware responses:Text-to-SQL (NSQL)
Convert natural language queries into SQL using built-in prompt templates:AI Agents with Tools
Build intelligent agents with function calling via MCP:Embeddings Pipeline
Generate embeddings at scale for semantic search:Model Lifecycle
Spice manages the complete model lifecycle:- Model Loading: Automatic download from HuggingFace, local filesystem, or Spice.ai Cloud
- Format Support: GGUF, GGML, SafeTensor for LLMs; ONNX and Model2Vec for embeddings
- Hardware Acceleration: Automatic CUDA/Metal detection and utilization
- Rate Limiting: Built-in rate controllers for API providers
- Caching: Request and result caching for improved performance
Supported Providers
LLM Providers
- OpenAI - GPT-4, GPT-4o, GPT-3.5-turbo models
- Anthropic - Claude 3 Opus, Sonnet, Haiku models
- xAI - Grok models
- AWS Bedrock - Amazon Nova, Anthropic Claude via Bedrock
- Azure OpenAI - Azure-hosted OpenAI models
- Local Models - GGUF/GGML/SafeTensor formats with llama.cpp acceleration
Embedding Providers
- OpenAI - text-embedding-3-small, text-embedding-3-large
- AWS Bedrock - Amazon Titan, Cohere embeddings
- HuggingFace - Any ONNX-compatible embedding model
- Model2Vec - 500x faster static embeddings
- Local Models - ONNX format with hardware acceleration
Getting Started
1. Configure a Model
Add a model to yourspicepod.yaml:
2. Start Spice Runtime
3. Query with OpenAI SDK
Performance Considerations
Local Model Serving
- CUDA Acceleration: Automatic GPU utilization on NVIDIA hardware
- Metal Acceleration: Optimized for Apple Silicon (M1/M2/M3)
- Memory Management: Models loaded on-demand, unloaded when idle
- Batch Processing: Automatic request batching for throughput optimization
Model2Vec Embeddings
For embedding-intensive workloads, Model2Vec provides:- 500x Faster: Static embeddings vs. transformer models
- Lower Memory: Minimal memory footprint
- CPU-Optimized: Efficient on CPU without GPU requirements
Next Steps
OpenAI Compatibility
Learn about OpenAI-compatible APIs and endpoints
Model Providers
Configure hosted and local model providers
Embeddings
Generate embeddings for semantic search
RAG
Build retrieval-augmented generation workflows