Request
Endpoint
Request Body
The model name to use for chat completion (e.g.,
llama3.2, mistral)Array of message objects representing the conversation history
Enable streaming of response chunks. Set to
false to wait for the complete response.Format to return response in. Use
"json" for JSON mode or provide a JSON schema object.Model-specific options to customize inference behavior
Duration to keep the model loaded in memory (e.g.,
"5m", "1h", or -1 for indefinite)List of tools the model can use for function calling
Enable thinking mode for reasoning models. Can be
true/false or "high", "medium", "low"Truncate chat history if prompt exceeds context length
Shift chat history when hitting context length instead of erroring
Return log probabilities for output tokens
Number of most likely tokens to return at each position (0-20). Requires
logprobs: trueResponse
Response Fields
The model name used for generation
Timestamp of when the response was created (ISO 8601 format)
The generated message
Whether the response is complete
Reason for completion:
stop, length, load, unloadTotal time spent generating response (nanoseconds)
Time spent loading the model (nanoseconds)
Number of tokens in the prompt
Time spent evaluating the prompt (nanoseconds)
Number of tokens generated
Time spent generating response (nanoseconds)
Log probability information for each token (when
logprobs: true)Examples
Example Response
Streaming Example
Error Responses
Error message describing what went wrong
Common Errors
- 400 Bad Request: Invalid request body or model name
- 404 Not Found: Model not found (need to pull it first)
- 500 Internal Server Error: Server error during generation
When streaming is disabled (
"stream": false), the complete response is returned as a single JSON object. With streaming enabled (default), responses are sent as newline-delimited JSON (NDJSON).