Skip to main content

Overview

Ubicloud’s AI Inference Endpoints allow you to deploy machine learning models as scalable API endpoints. Each endpoint runs your chosen model with automatic load balancing, health monitoring, and billing tracking.

Architecture

An inference endpoint consists of:
  • Replicas: VM instances running your model with the inference engine
  • Load Balancer: Distributes requests across healthy replicas with SSL termination
  • Private Subnet: Isolated network environment for the endpoint
  • Inference Gateway: Handles authentication, rate limiting, and usage tracking

Creating an Endpoint

Using Predefined Models

Create an endpoint from Ubicloud’s model catalog:
endpoint = Prog::Ai::InferenceEndpointNexus.assemble_with_model(
  project_id: project.id,
  location_id: location.id,
  name: "my-llm-endpoint",
  model_id: "llama-3-8b",
  replica_count: 2,
  is_public: false
)

Custom Configuration

For custom models, specify all parameters:
endpoint = Prog::Ai::InferenceEndpointNexus.assemble(
  project_id: project.id,
  location_id: location.id,
  name: "custom-model",
  boot_image: "ai-ubuntu-2204-vllm",
  vm_size: "standard-gpu-8",
  model_name: "meta-llama/Meta-Llama-3-8B",
  engine: "vllm",
  engine_params: "--max-model-len 4096 --gpu-memory-utilization 0.9",
  storage_volumes: [{size_gib: 100, encrypted: true}],
  replica_count: 1,
  is_public: false,
  gpu_count: 1,
  max_requests: 500,
  max_project_rps: 100,
  max_project_tps: 10000
)
Endpoints are created in the Config.inference_endpoint_service_project_id project and use a dedicated firewall named inference-endpoint-firewall.

Inference Engines

vLLM

The default engine for serving LLMs with optimized inference:
engine: "vllm"
engine_params: "--max-model-len 4096 --tensor-parallel-size 2"
For CPU-only inference, vLLM automatically uses the vllm-cpu environment:
/opt/miniconda/envs/vllm-cpu/bin/vllm serve /ie/models/model \
  --served-model-name meta-llama/Meta-Llama-3-8B \
  --disable-log-requests --host 127.0.0.1 \
  --max-model-len 4096

RunPod Integration

For external GPU providers:
engine: "runpod"
external_config: {
  "data_center": "EU-RO-1",
  "gpu_count": 1,
  "gpu_type": "NVIDIA A100 80GB PCIe",
  "disk_gib": 50,
  "min_vcpu_count": 8,
  "min_memory_gib": 32,
  "image_name": "runpod/pytorch:2.1.0-py3.10-cuda12.1.0-devel",
  "model_name_hf": "meta-llama/Meta-Llama-3-8B"
}

Making Requests

Send chat completion requests to your endpoint:
response = inference_endpoint.chat_completion_request(
  "What is cloud computing?",
  hostname,
  api_key
)
The endpoint follows the OpenAI-compatible API format:
curl -X POST https://your-endpoint.ai.ubicloud.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Scaling and Replicas

Automatic Reconciliation

The endpoint automatically maintains the desired replica count:
actual_replica_count = replicas.count { !(it.destroy_set? || it.strand.label == "destroy") }
desired_replica_count = inference_endpoint.replica_count

if actual_replica_count < desired_replica_count
  # Add new replicas
  (desired_replica_count - actual_replica_count).times do
    Prog::Ai::InferenceEndpointReplicaNexus.assemble(inference_endpoint.id)
  end
elsif actual_replica_count > desired_replica_count
  # Remove excess replicas (prioritize running ones)
  victims.each(&:incr_destroy)
end

Updating Replica Count

inference_endpoint.update(replica_count: 3)
# Replicas will be automatically added or removed during the next reconciliation
1

Update replica count

Set the desired replica_count on the inference endpoint
2

Reconciliation

The nexus program reconciles every 60 seconds in the wait state
3

Replica provisioning

New replicas are created or excess replicas are destroyed

Health Monitoring

Each replica monitors its health through the load balancer:
  • Health Check Protocol: HTTPS
  • Health Check Endpoint: /health
  • Down Threshold: 3 consecutive failures
  • Up Threshold: 1 successful check
When a replica becomes unavailable, it transitions to the unavailable state and creates a page:
Prog::PageNexus.assemble(
  "Replica #{replica.ubid} of inference endpoint #{endpoint.name} is unavailable",
  ["InferenceEndpointReplicaUnavailable", replica.ubid],
  replica.ubid,
  severity: "warning"
)

Rate Limiting and Quotas

Each endpoint supports per-project rate limits:
  • max_requests: Maximum concurrent requests per replica (default: 500)
  • max_project_rps: Requests per second per project (default: 100)
  • max_project_tps: Tokens per second per project (default: 10,000)
These are enforced by the inference gateway on each replica.

Billing

Usage is tracked separately for input and output tokens:
prompt_billing_resource = "#{model_name}-input"
completion_billing_resource = "#{model_name}-output"
Billing records are updated daily based on token usage collected from the inference gateway:
BillingRecord.create(
  project_id: project.id,
  resource_id: inference_endpoint.id,
  resource_name: "#{model_name}-input 2024-01-15",
  billing_rate_id: rate_id,
  span: begin_time...end_time,
  amount: prompt_tokens
)

Public vs Private Endpoints

Private Endpoints

Default mode. Only accessible with valid API keys from authorized projects:
is_public: false

Public Endpoints

Accessible by any project with valid API keys and payment methods:
is_public: true
Public endpoints use a simpler hostname without the UBID suffix:
custom_hostname_prefix = name + (is_public ? "" : "-#{ubid.to_s[-5...]}")

Endpoint States

The endpoint progresses through these states:
StateDescription
startInitial provisioning, creating replicas
wait_replicasWaiting for all replicas to reach wait state
waitRunning and serving requests, reconciling every 60 seconds
destroyCleaning up replicas and resources
self_destroyFinal cleanup before deletion
Endpoints check display_state for user-facing status:
  • “creating” - During initial setup
  • “running” - When in wait state
  • “deleting” - When being destroyed

Custom DNS

Endpoints can use custom DNS zones:
custom_dns_zone = DnsZone.where(
  project_id: Config.inference_endpoint_service_project_id,
  name: "ai.ubicloud.com"
).first
This creates hostnames like:
  • Private: my-model-a1b2c.ai.ubicloud.com
  • Public: my-model.ai.ubicloud.com

Maintenance Mode

Put an endpoint in maintenance to suppress alerts:
inference_endpoint.incr_maintenance
# Replica unavailability pages won't be created

inference_endpoint.decr_maintenance
# Normal monitoring resumes

Build docs developers (and LLMs) love