AI & ML Overview

Spice.ai provides a unified runtime for both data query and AI inference, making it the ideal foundation for data-grounded AI applications and intelligent agents.

Key Capabilities

Spice combines four industry-standard APIs for building AI applications:

OpenAI-Compatible APIs - HTTP APIs for chat completions and embeddings with SDK compatibility
Local Model Serving - Run LLMs and embedding models locally with hardware acceleration
Model Gateway - Connect to hosted providers (OpenAI, Anthropic, xAI, AWS Bedrock, Azure)
MCP Integration - Tool/function calling via Model Context Protocol (MCP) using HTTP+SSE

Architecture

Spice’s AI-native architecture provides:

Unified Data + AI Runtime: Query data and run inference in a single engine
OpenAI SDK Compatibility: Drop-in replacement for OpenAI client libraries
Hardware Acceleration: CUDA and Metal support for local model inference
Vector Search Integration: Native support for RAG workflows with vector similarity search
Flexible Deployment: Run as sidecar, microservice, or cluster from edge to cloud

Use Cases

Retrieval-Augmented Generation (RAG)

Combine vector similarity search with LLM inference for context-aware responses:

SELECT content, _score
FROM vector_search(documents, 'machine learning algorithms', 10)
WHERE category = 'technical'
ORDER BY _score DESC;

Learn more: RAG Documentation

Text-to-SQL (NSQL)

Convert natural language queries into SQL using built-in prompt templates:

spice sql
sql> nsql "show me the top 10 customers by revenue"

AI Agents with Tools

Build intelligent agents with function calling via MCP:

models:
  - from: openai:gpt-4o-mini
    name: my-agent
    params:
      openai_api_key: ${secrets:openai_key}

tools:
  - from: mcp:http://localhost:3000
    name: database-tools

Learn more: MCP Integration

Embeddings Pipeline

Generate embeddings at scale for semantic search:

columns:
  - name: description
    embeddings:
      - from: text-embedding
        row_id:
          - id

Learn more: Embeddings

Model Lifecycle

Spice manages the complete model lifecycle:

Model Loading: Automatic download from HuggingFace, local filesystem, or Spice.ai Cloud
Format Support: GGUF, GGML, SafeTensor for LLMs; ONNX and Model2Vec for embeddings
Hardware Acceleration: Automatic CUDA/Metal detection and utilization
Rate Limiting: Built-in rate controllers for API providers
Caching: Request and result caching for improved performance

Supported Providers

LLM Providers

OpenAI - GPT-4, GPT-4o, GPT-3.5-turbo models
Anthropic - Claude 3 Opus, Sonnet, Haiku models
xAI - Grok models
AWS Bedrock - Amazon Nova, Anthropic Claude via Bedrock
Azure OpenAI - Azure-hosted OpenAI models
Local Models - GGUF/GGML/SafeTensor formats with llama.cpp acceleration

Embedding Providers

OpenAI - text-embedding-3-small, text-embedding-3-large
AWS Bedrock - Amazon Titan, Cohere embeddings
HuggingFace - Any ONNX-compatible embedding model
Model2Vec - 500x faster static embeddings
Local Models - ONNX format with hardware acceleration

Getting Started

1. Configure a Model

Add a model to your spicepod.yaml:

version: v1
kind: Spicepod
name: my-app

models:
  - from: openai:gpt-4o-mini
    name: chat-model
    params:
      openai_api_key: ${secrets:openai_key}

embeddings:
  - from: openai:text-embedding-3-small
    name: text-embedding
    params:
      openai_api_key: ${secrets:openai_key}

2. Start Spice Runtime

spice run

3. Query with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8090/v1",
    api_key="not-needed"  # Spice handles auth
)

# Chat completion
response = client.chat.completions.create(
    model="chat-model",
    messages=[
        {"role": "user", "content": "Explain RAG in one sentence"}
    ]
)

print(response.choices[0].message.content)

# Generate embeddings
embedding = client.embeddings.create(
    model="text-embedding",
    input="machine learning algorithms"
)

print(embedding.data[0].embedding)

Performance Considerations

Local Model Serving

CUDA Acceleration: Automatic GPU utilization on NVIDIA hardware
Metal Acceleration: Optimized for Apple Silicon (M1/M2/M3)
Memory Management: Models loaded on-demand, unloaded when idle
Batch Processing: Automatic request batching for throughput optimization

Model2Vec Embeddings

For embedding-intensive workloads, Model2Vec provides:

500x Faster: Static embeddings vs. transformer models
Lower Memory: Minimal memory footprint
CPU-Optimized: Efficient on CPU without GPU requirements

embeddings:
  - from: model2vec:minishlab/potion-base-8M
    name: fast-embed

Next Steps

OpenAI Compatibility

Learn about OpenAI-compatible APIs and endpoints

Model Providers

Configure hosted and local model providers

Embeddings

Generate embeddings for semantic search

RAG

Build retrieval-augmented generation workflows

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

Key Capabilities

Architecture

Use Cases

Retrieval-Augmented Generation (RAG)

Text-to-SQL (NSQL)

AI Agents with Tools

Embeddings Pipeline

Model Lifecycle

Supported Providers

LLM Providers

Embedding Providers

Getting Started

1. Configure a Model

2. Start Spice Runtime

3. Query with OpenAI SDK

Performance Considerations

Local Model Serving

Model2Vec Embeddings

Next Steps

OpenAI Compatibility

Model Providers

Embeddings

RAG

Build docs developers (and LLMs) love

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

​Key Capabilities

​Architecture

​Use Cases

​Retrieval-Augmented Generation (RAG)

​Text-to-SQL (NSQL)

​AI Agents with Tools

​Embeddings Pipeline

​Model Lifecycle

​Supported Providers

​LLM Providers

​Embedding Providers

​Getting Started

​1. Configure a Model

​2. Start Spice Runtime

​3. Query with OpenAI SDK

​Performance Considerations

​Local Model Serving

​Model2Vec Embeddings

​Next Steps

OpenAI Compatibility

Model Providers

Embeddings

RAG

Build docs developers (and LLMs) love

Key Capabilities

Architecture

Use Cases

Retrieval-Augmented Generation (RAG)

Text-to-SQL (NSQL)

AI Agents with Tools

Embeddings Pipeline

Model Lifecycle

Supported Providers

LLM Providers

Embedding Providers

Getting Started

1. Configure a Model

2. Start Spice Runtime

3. Query with OpenAI SDK

Performance Considerations

Local Model Serving

Model2Vec Embeddings

Next Steps