AI Inference Endpoints

Overview

Ubicloud’s AI Inference Endpoints allow you to deploy machine learning models as scalable API endpoints. Each endpoint runs your chosen model with automatic load balancing, health monitoring, and billing tracking.

Architecture

An inference endpoint consists of:

Replicas: VM instances running your model with the inference engine
Load Balancer: Distributes requests across healthy replicas with SSL termination
Private Subnet: Isolated network environment for the endpoint
Inference Gateway: Handles authentication, rate limiting, and usage tracking

Creating an Endpoint

Using Predefined Models

Create an endpoint from Ubicloud’s model catalog:

endpoint = Prog::Ai::InferenceEndpointNexus.assemble_with_model(
  project_id: project.id,
  location_id: location.id,
  name: "my-llm-endpoint",
  model_id: "llama-3-8b",
  replica_count: 2,
  is_public: false
)

Custom Configuration

For custom models, specify all parameters:

endpoint = Prog::Ai::InferenceEndpointNexus.assemble(
  project_id: project.id,
  location_id: location.id,
  name: "custom-model",
  boot_image: "ai-ubuntu-2204-vllm",
  vm_size: "standard-gpu-8",
  model_name: "meta-llama/Meta-Llama-3-8B",
  engine: "vllm",
  engine_params: "--max-model-len 4096 --gpu-memory-utilization 0.9",
  storage_volumes: [{size_gib: 100, encrypted: true}],
  replica_count: 1,
  is_public: false,
  gpu_count: 1,
  max_requests: 500,
  max_project_rps: 100,
  max_project_tps: 10000
)

Endpoints are created in the Config.inference_endpoint_service_project_id project and use a dedicated firewall named inference-endpoint-firewall.

Inference Engines

vLLM

The default engine for serving LLMs with optimized inference:

engine: "vllm"
engine_params: "--max-model-len 4096 --tensor-parallel-size 2"

For CPU-only inference, vLLM automatically uses the vllm-cpu environment:

/opt/miniconda/envs/vllm-cpu/bin/vllm serve /ie/models/model \
  --served-model-name meta-llama/Meta-Llama-3-8B \
  --disable-log-requests --host 127.0.0.1 \
  --max-model-len 4096

RunPod Integration

For external GPU providers:

engine: "runpod"
external_config: {
  "data_center": "EU-RO-1",
  "gpu_count": 1,
  "gpu_type": "NVIDIA A100 80GB PCIe",
  "disk_gib": 50,
  "min_vcpu_count": 8,
  "min_memory_gib": 32,
  "image_name": "runpod/pytorch:2.1.0-py3.10-cuda12.1.0-devel",
  "model_name_hf": "meta-llama/Meta-Llama-3-8B"
}

Making Requests

Send chat completion requests to your endpoint:

response = inference_endpoint.chat_completion_request(
  "What is cloud computing?",
  hostname,
  api_key
)

The endpoint follows the OpenAI-compatible API format:

curl -X POST https://your-endpoint.ai.ubicloud.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Scaling and Replicas

Automatic Reconciliation

The endpoint automatically maintains the desired replica count:

actual_replica_count = replicas.count { !(it.destroy_set? || it.strand.label == "destroy") }
desired_replica_count = inference_endpoint.replica_count

if actual_replica_count < desired_replica_count
  # Add new replicas
  (desired_replica_count - actual_replica_count).times do
    Prog::Ai::InferenceEndpointReplicaNexus.assemble(inference_endpoint.id)
  end
elsif actual_replica_count > desired_replica_count
  # Remove excess replicas (prioritize running ones)
  victims.each(&:incr_destroy)
end

Updating Replica Count

inference_endpoint.update(replica_count: 3)
# Replicas will be automatically added or removed during the next reconciliation

Update replica count

Set the desired replica_count on the inference endpoint

Reconciliation

The nexus program reconciles every 60 seconds in the wait state

Replica provisioning

New replicas are created or excess replicas are destroyed

Health Monitoring

Each replica monitors its health through the load balancer:

Health Check Protocol: HTTPS
Health Check Endpoint: /health
Down Threshold: 3 consecutive failures
Up Threshold: 1 successful check

When a replica becomes unavailable, it transitions to the unavailable state and creates a page:

Prog::PageNexus.assemble(
  "Replica #{replica.ubid} of inference endpoint #{endpoint.name} is unavailable",
  ["InferenceEndpointReplicaUnavailable", replica.ubid],
  replica.ubid,
  severity: "warning"
)

Rate Limiting and Quotas

Each endpoint supports per-project rate limits:

max_requests: Maximum concurrent requests per replica (default: 500)
max_project_rps: Requests per second per project (default: 100)
max_project_tps: Tokens per second per project (default: 10,000)

These are enforced by the inference gateway on each replica.

Billing

Usage is tracked separately for input and output tokens:

prompt_billing_resource = "#{model_name}-input"
completion_billing_resource = "#{model_name}-output"

Billing records are updated daily based on token usage collected from the inference gateway:

BillingRecord.create(
  project_id: project.id,
  resource_id: inference_endpoint.id,
  resource_name: "#{model_name}-input 2024-01-15",
  billing_rate_id: rate_id,
  span: begin_time...end_time,
  amount: prompt_tokens
)

Public vs Private Endpoints

Private Endpoints

Default mode. Only accessible with valid API keys from authorized projects:

is_public: false

Public Endpoints

Accessible by any project with valid API keys and payment methods:

is_public: true

Public endpoints use a simpler hostname without the UBID suffix:

custom_hostname_prefix = name + (is_public ? "" : "-#{ubid.to_s[-5...]}")

Endpoint States

The endpoint progresses through these states:

State	Description
`start`	Initial provisioning, creating replicas
`wait_replicas`	Waiting for all replicas to reach `wait` state
`wait`	Running and serving requests, reconciling every 60 seconds
`destroy`	Cleaning up replicas and resources
`self_destroy`	Final cleanup before deletion

Endpoints check display_state for user-facing status:

“creating” - During initial setup
“running” - When in wait state
“deleting” - When being destroyed

Custom DNS

Endpoints can use custom DNS zones:

custom_dns_zone = DnsZone.where(
  project_id: Config.inference_endpoint_service_project_id,
  name: "ai.ubicloud.com"
).first

This creates hostnames like:

Private: my-model-a1b2c.ai.ubicloud.com
Public: my-model.ai.ubicloud.com

Maintenance Mode

Put an endpoint in maintenance to suppress alerts:

inference_endpoint.incr_maintenance
# Replica unavailability pages won't be created

inference_endpoint.decr_maintenance
# Normal monitoring resumes

Get Started

Core Concepts

Compute

Networking

Storage

Databases

AI & ML

GitHub Integration

Access Control

Self-Hosting

Guides

AI Inference Endpoints

Overview

Architecture

Creating an Endpoint

Using Predefined Models

Custom Configuration

Inference Engines

vLLM

RunPod Integration

Making Requests

Scaling and Replicas

Automatic Reconciliation

Updating Replica Count

Health Monitoring

Rate Limiting and Quotas

Billing

Public vs Private Endpoints

Private Endpoints

Public Endpoints

Endpoint States

Custom DNS

Maintenance Mode

Build docs developers (and LLMs) love

Get Started

Core Concepts

Compute

Networking

Storage

Databases

AI & ML

GitHub Integration

Access Control

Self-Hosting

Guides

​Overview

​Architecture

​Creating an Endpoint

​Using Predefined Models

​Custom Configuration

​Inference Engines

​vLLM

​RunPod Integration

​Making Requests

​Scaling and Replicas

​Automatic Reconciliation

​Updating Replica Count

​Health Monitoring

​Rate Limiting and Quotas

​Billing

​Public vs Private Endpoints

​Private Endpoints

​Public Endpoints

​Endpoint States

​Custom DNS

​Maintenance Mode

Build docs developers (and LLMs) love

Overview

Architecture

Creating an Endpoint

Using Predefined Models

Custom Configuration

Inference Engines

vLLM

RunPod Integration

Making Requests

Scaling and Replicas

Automatic Reconciliation

Updating Replica Count

Health Monitoring

Rate Limiting and Quotas

Billing

Public vs Private Endpoints

Private Endpoints

Public Endpoints

Endpoint States

Custom DNS

Maintenance Mode