Overview
Ubicloud’s AI Inference Endpoints allow you to deploy machine learning models as scalable API endpoints. Each endpoint runs your chosen model with automatic load balancing, health monitoring, and billing tracking.
Architecture
An inference endpoint consists of:
- Replicas: VM instances running your model with the inference engine
- Load Balancer: Distributes requests across healthy replicas with SSL termination
- Private Subnet: Isolated network environment for the endpoint
- Inference Gateway: Handles authentication, rate limiting, and usage tracking
Creating an Endpoint
Using Predefined Models
Create an endpoint from Ubicloud’s model catalog:
endpoint = Prog::Ai::InferenceEndpointNexus.assemble_with_model(
project_id: project.id,
location_id: location.id,
name: "my-llm-endpoint",
model_id: "llama-3-8b",
replica_count: 2,
is_public: false
)
Custom Configuration
For custom models, specify all parameters:
endpoint = Prog::Ai::InferenceEndpointNexus.assemble(
project_id: project.id,
location_id: location.id,
name: "custom-model",
boot_image: "ai-ubuntu-2204-vllm",
vm_size: "standard-gpu-8",
model_name: "meta-llama/Meta-Llama-3-8B",
engine: "vllm",
engine_params: "--max-model-len 4096 --gpu-memory-utilization 0.9",
storage_volumes: [{size_gib: 100, encrypted: true}],
replica_count: 1,
is_public: false,
gpu_count: 1,
max_requests: 500,
max_project_rps: 100,
max_project_tps: 10000
)
Endpoints are created in the Config.inference_endpoint_service_project_id project and use a dedicated firewall named inference-endpoint-firewall.
Inference Engines
vLLM
The default engine for serving LLMs with optimized inference:
engine: "vllm"
engine_params: "--max-model-len 4096 --tensor-parallel-size 2"
For CPU-only inference, vLLM automatically uses the vllm-cpu environment:
/opt/miniconda/envs/vllm-cpu/bin/vllm serve /ie/models/model \
--served-model-name meta-llama/Meta-Llama-3-8B \
--disable-log-requests --host 127.0.0.1 \
--max-model-len 4096
RunPod Integration
For external GPU providers:
engine: "runpod"
external_config: {
"data_center": "EU-RO-1",
"gpu_count": 1,
"gpu_type": "NVIDIA A100 80GB PCIe",
"disk_gib": 50,
"min_vcpu_count": 8,
"min_memory_gib": 32,
"image_name": "runpod/pytorch:2.1.0-py3.10-cuda12.1.0-devel",
"model_name_hf": "meta-llama/Meta-Llama-3-8B"
}
Making Requests
Send chat completion requests to your endpoint:
response = inference_endpoint.chat_completion_request(
"What is cloud computing?",
hostname,
api_key
)
The endpoint follows the OpenAI-compatible API format:
curl -X POST https://your-endpoint.ai.ubicloud.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Scaling and Replicas
Automatic Reconciliation
The endpoint automatically maintains the desired replica count:
actual_replica_count = replicas.count { !(it.destroy_set? || it.strand.label == "destroy") }
desired_replica_count = inference_endpoint.replica_count
if actual_replica_count < desired_replica_count
# Add new replicas
(desired_replica_count - actual_replica_count).times do
Prog::Ai::InferenceEndpointReplicaNexus.assemble(inference_endpoint.id)
end
elsif actual_replica_count > desired_replica_count
# Remove excess replicas (prioritize running ones)
victims.each(&:incr_destroy)
end
Updating Replica Count
inference_endpoint.update(replica_count: 3)
# Replicas will be automatically added or removed during the next reconciliation
Update replica count
Set the desired replica_count on the inference endpoint
Reconciliation
The nexus program reconciles every 60 seconds in the wait state
Replica provisioning
New replicas are created or excess replicas are destroyed
Health Monitoring
Each replica monitors its health through the load balancer:
- Health Check Protocol: HTTPS
- Health Check Endpoint:
/health
- Down Threshold: 3 consecutive failures
- Up Threshold: 1 successful check
When a replica becomes unavailable, it transitions to the unavailable state and creates a page:
Prog::PageNexus.assemble(
"Replica #{replica.ubid} of inference endpoint #{endpoint.name} is unavailable",
["InferenceEndpointReplicaUnavailable", replica.ubid],
replica.ubid,
severity: "warning"
)
Rate Limiting and Quotas
Each endpoint supports per-project rate limits:
- max_requests: Maximum concurrent requests per replica (default: 500)
- max_project_rps: Requests per second per project (default: 100)
- max_project_tps: Tokens per second per project (default: 10,000)
These are enforced by the inference gateway on each replica.
Billing
Usage is tracked separately for input and output tokens:
prompt_billing_resource = "#{model_name}-input"
completion_billing_resource = "#{model_name}-output"
Billing records are updated daily based on token usage collected from the inference gateway:
BillingRecord.create(
project_id: project.id,
resource_id: inference_endpoint.id,
resource_name: "#{model_name}-input 2024-01-15",
billing_rate_id: rate_id,
span: begin_time...end_time,
amount: prompt_tokens
)
Public vs Private Endpoints
Private Endpoints
Default mode. Only accessible with valid API keys from authorized projects:
Public Endpoints
Accessible by any project with valid API keys and payment methods:
Public endpoints use a simpler hostname without the UBID suffix:
custom_hostname_prefix = name + (is_public ? "" : "-#{ubid.to_s[-5...]}")
Endpoint States
The endpoint progresses through these states:
| State | Description |
|---|
start | Initial provisioning, creating replicas |
wait_replicas | Waiting for all replicas to reach wait state |
wait | Running and serving requests, reconciling every 60 seconds |
destroy | Cleaning up replicas and resources |
self_destroy | Final cleanup before deletion |
Endpoints check display_state for user-facing status:
- “creating” - During initial setup
- “running” - When in
wait state
- “deleting” - When being destroyed
Custom DNS
Endpoints can use custom DNS zones:
custom_dns_zone = DnsZone.where(
project_id: Config.inference_endpoint_service_project_id,
name: "ai.ubicloud.com"
).first
This creates hostnames like:
- Private:
my-model-a1b2c.ai.ubicloud.com
- Public:
my-model.ai.ubicloud.com
Maintenance Mode
Put an endpoint in maintenance to suppress alerts:
inference_endpoint.incr_maintenance
# Replica unavailability pages won't be created
inference_endpoint.decr_maintenance
# Normal monitoring resumes