Deploy Models with Azure Machine Learning
After training machine learning models, deploy them to production for inference using Azure Machine Learning endpoints. Deploy for real-time predictions or batch processing at scale.
Azure ML provides managed endpoints with automatic scaling, monitoring, and security - no infrastructure management required.
Inference and Endpoints
Inference is the process of applying new input data to a machine learning model to generate outputs (predictions, classifications, clusters, etc.).
An endpoint is a stable, durable URL that can be used to request predictions from your model.
Online Endpoints Real-time inference with low latency
Batch Endpoints Asynchronous processing of large datasets
Endpoint Anatomy
Endpoint
Provides:
Stable URL : e.g., https://my-endpoint.eastus.inference.ml.azure.com
Authentication : Key-based or Microsoft Entra ID
Authorization : Role-based access control
Deployment
Contains:
Model : Trained model files
Code : Scoring script (optional for MLflow models)
Environment : Software dependencies
Compute : Resources to run inference
One endpoint can contain multiple deployments, enabling A/B testing and safe rollouts.
Deployment Types
Managed Online Endpoints
Batch Endpoints
Kubernetes Endpoints
Best for: Real-time, low-latency inferenceFeatures:
Fully managed compute and scaling
Built-in monitoring and logging
Traffic splitting for A/B testing
Zero-downtime updates
Cost tracking per deployment
Use when:
Response time is critical (<1 second)
Request-response pattern
Small payloads (fits in HTTP request)
Need to scale based on traffic
from azure.ai.ml.entities import ManagedOnlineEndpoint
endpoint = ManagedOnlineEndpoint(
name = "my-endpoint" ,
description = "Production inference endpoint" ,
auth_mode = "key"
)
Best for: Long-running, large-scale inferenceFeatures:
Process files from Azure Storage
Parallel processing across compute nodes
Scheduled or on-demand execution
No compute cost when idle
Deploy models or pipelines
Use when:
Processing large files or datasets
Can tolerate longer processing times
Data stored in cloud storage
Want cost optimization
from azure.ai.ml.entities import BatchEndpoint
endpoint = BatchEndpoint(
name = "my-batch-endpoint" ,
description = "Batch scoring endpoint"
)
Best for: On-premises or edge deploymentFeatures:
Deploy anywhere (cloud, edge, on-prem)
Full infrastructure control
Custom networking
GPU support
Use when:
Need on-premises deployment
Edge computing scenarios
Existing Kubernetes infrastructure
Custom networking requirements
Quick Start: Deploy a Model
1. Register Your Model
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
DefaultAzureCredential(),
subscription_id = "<subscription-id>" ,
resource_group = "<resource-group>" ,
workspace_name = "<workspace>"
)
# Register model
model = Model(
path = "./model" ,
name = "sklearn-classifier" ,
description = "Iris classification model" ,
type = "mlflow_model" # or "custom_model"
)
registered_model = ml_client.models.create_or_update(model)
print ( f "Registered: { registered_model.name } v { registered_model.version } " )
2. Create Endpoint
from azure.ai.ml.entities import ManagedOnlineEndpoint
endpoint = ManagedOnlineEndpoint(
name = "iris-classifier-endpoint" ,
description = "Iris species classification" ,
auth_mode = "key" ,
tags = { "environment" : "production" , "team" : "ml-ops" }
)
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print ( f "Endpoint created: { endpoint.name } " )
3. Deploy Model
from azure.ai.ml.entities import ManagedOnlineDeployment, Model
deployment = ManagedOnlineDeployment(
name = "blue" ,
endpoint_name = "iris-classifier-endpoint" ,
model = registered_model,
instance_type = "Standard_DS3_v2" ,
instance_count = 1 ,
request_settings = {
"request_timeout_ms" : 90000 ,
"max_concurrent_requests_per_instance" : 1
},
liveness_probe = {
"initial_delay" : 10 ,
"period" : 10 ,
"timeout" : 2 ,
"failure_threshold" : 3
}
)
ml_client.online_deployments.begin_create_or_update(deployment).result()
# Route 100% traffic to deployment
endpoint.traffic = { "blue" : 100 }
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
Create score.py: import json
import joblib
import numpy as np
def init ():
global model
model_path = os.path.join(
os.getenv( "AZUREML_MODEL_DIR" ),
"model.pkl"
)
model = joblib.load(model_path)
def run ( raw_data ):
try :
data = json.loads(raw_data)[ "data" ]
data = np.array(data)
predictions = model.predict(data)
return predictions.tolist()
except Exception as e:
return json.dumps({ "error" : str (e)})
Deploy with scoring script: from azure.ai.ml.entities import (
ManagedOnlineDeployment,
CodeConfiguration,
Environment
)
deployment = ManagedOnlineDeployment(
name = "blue" ,
endpoint_name = "iris-classifier-endpoint" ,
model = registered_model,
code_configuration = CodeConfiguration(
code = "./src" ,
scoring_script = "score.py"
),
environment = "azureml://registries/azureml/environments/sklearn-1.5/versions/1" ,
instance_type = "Standard_DS3_v2" ,
instance_count = 1
)
ml_client.online_deployments.begin_create_or_update(deployment).result()
4. Test the Deployment
# Test with sample data
sample_data = {
"data" : [
[ 5.1 , 3.5 , 1.4 , 0.2 ],
[ 6.2 , 2.9 , 4.3 , 1.3 ]
]
}
# Invoke endpoint
response = ml_client.online_endpoints.invoke(
endpoint_name = "iris-classifier-endpoint" ,
request_file = "sample_request.json" ,
deployment_name = "blue" # Optional: test specific deployment
)
print ( f "Predictions: { response } " )
Deployment Patterns
Blue-Green Deployment
Switch traffic between two deployments instantly:
# Deploy new version to "green"
green_deployment = ManagedOnlineDeployment(
name = "green" ,
endpoint_name = "iris-classifier-endpoint" ,
model = new_model_version,
instance_type = "Standard_DS3_v2" ,
instance_count = 1
)
ml_client.online_deployments.begin_create_or_update(green_deployment).result()
# Test green deployment
response = ml_client.online_endpoints.invoke(
endpoint_name = "iris-classifier-endpoint" ,
request_file = "test.json" ,
deployment_name = "green"
)
# Switch all traffic to green
endpoint.traffic = { "green" : 100 }
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
# Delete old blue deployment
ml_client.online_deployments.begin_delete(
name = "blue" ,
endpoint_name = "iris-classifier-endpoint"
).result()
Canary Deployment
Gradually shift traffic to test new version:
# Start with 10% traffic to new version
endpoint.traffic = { "blue" : 90 , "green" : 10 }
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
# Monitor metrics, then increase
endpoint.traffic = { "blue" : 50 , "green" : 50 }
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
# Complete rollout
endpoint.traffic = { "green" : 100 }
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
A/B Testing
Compare two model versions in production:
# Split traffic evenly
endpoint.traffic = {
"model-v1" : 50 ,
"model-v2" : 50
}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
# Monitor business metrics to choose winner
# Then route 100% to better performing model
Autoscaling
Configure automatic scaling based on metrics:
from azure.ai.ml.entities import OnlineScaleSettings
deployment = ManagedOnlineDeployment(
name = "blue" ,
endpoint_name = "my-endpoint" ,
model = model,
instance_type = "Standard_DS3_v2" ,
instance_count = 1 ,
scale_settings = OnlineScaleSettings(
scale_type = "TargetUtilization" ,
min_instances = 1 ,
max_instances = 10 ,
target_utilization_percentage = 70 ,
polling_interval = 10
)
)
Resource Limits
Control compute resources:
deployment = ManagedOnlineDeployment(
name = "blue" ,
endpoint_name = "my-endpoint" ,
model = model,
instance_type = "Standard_DS3_v2" ,
instance_count = 2 ,
request_settings = {
"request_timeout_ms" : 90000 ,
"max_concurrent_requests_per_instance" : 5 ,
"max_queue_wait_ms" : 60000
}
)
Monitoring Deployments
View Metrics in Azure Portal
Key metrics to monitor:
Request latency (P50, P95, P99)
Requests per second
HTTP status codes
CPU/GPU utilization
Memory usage
Query Logs
# Get deployment logs
logs = ml_client.online_deployments.get_logs(
name = "blue" ,
endpoint_name = "iris-classifier-endpoint" ,
lines = 100
)
print (logs)
Application Insights Integration
from applicationinsights import TelemetryClient
tc = TelemetryClient( '<instrumentation-key>' )
# Log custom events
tc.track_event( 'PredictionMade' , {
'model_version' : '1.2.0' ,
'latency_ms' : 45
})
tc.flush()
Security
Authentication
Key-Based
Microsoft Entra ID
# Get endpoint keys
keys = ml_client.online_endpoints.get_keys(
name = "iris-classifier-endpoint"
)
# Make authenticated request
import requests
headers = {
"Authorization" : f "Bearer { keys.primary_key } " ,
"Content-Type" : "application/json"
}
response = requests.post(
endpoint.scoring_uri,
headers = headers,
json = sample_data
)
from azure.identity import DefaultAzureCredential
import requests
credential = DefaultAzureCredential()
token = credential.get_token(
"https://ml.azure.com/.default"
)
headers = {
"Authorization" : f "Bearer { token.token } " ,
"Content-Type" : "application/json"
}
response = requests.post(
endpoint.scoring_uri,
headers = headers,
json = sample_data
)
Network Security
Deploy with private networking:
endpoint = ManagedOnlineEndpoint(
name = "secure-endpoint" ,
public_network_access = "disabled" ,
identity = {
"type" : "SystemAssigned"
}
)
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
Cost Optimization
Start with smaller instances and scale up: Instance Type vCPUs RAM Cost (Relative) Standard_DS2_v2 2 7GB 1x Standard_DS3_v2 4 14GB 2x Standard_DS4_v2 8 28GB 4x
Scale to zero during low-traffic periods: scale_settings = OnlineScaleSettings(
min_instances = 0 , # Scale to zero when idle
max_instances = 10
)
Batch for Bulk Processing
Use batch endpoints for large datasets - only pay during job execution: batch_deployment = BatchDeployment(
name = "default" ,
endpoint_name = "batch-endpoint" ,
model = model,
compute = "batch-cluster" ,
instance_count = 5 # Parallel processing
)
Monitor Cost per Deployment
Track spending in Azure Cost Management:
Filter by deployment tags
Set budget alerts
Analyze cost trends
Troubleshooting
Check:
Model files are valid
Scoring script has no syntax errors
Environment dependencies are correct
Sufficient quota for instance type
View deployment logs: logs = ml_client.online_deployments.get_logs(
name = "blue" ,
endpoint_name = "my-endpoint" ,
lines = 500
)
Solutions:
Use GPU instances for deep learning models
Optimize model (quantization, pruning)
Increase concurrent requests per instance
Enable request batching
Use model caching
Switch to larger instance type
Reduce batch size in scoring script
Optimize model memory usage
Use model compression techniques
Next Steps
Online Endpoints Learn more about real-time inference
Batch Scoring Deploy models for batch processing
Monitor Deployments Track performance and costs
MLOps Automate deployment pipelines