Skip to main content
Generate the next message in a chat with a provided model. This endpoint supports streaming responses and maintains conversation history through messages.

Request

Endpoint

POST /api/chat

Request Body

model
string
required
The model name to use for chat completion (e.g., llama3.2, mistral)
messages
array
required
Array of message objects representing the conversation history
stream
boolean
default:"true"
Enable streaming of response chunks. Set to false to wait for the complete response.
format
string | object
Format to return response in. Use "json" for JSON mode or provide a JSON schema object.
options
object
Model-specific options to customize inference behavior
keep_alive
string | number
default:"5m"
Duration to keep the model loaded in memory (e.g., "5m", "1h", or -1 for indefinite)
tools
array
List of tools the model can use for function calling
think
boolean | string
Enable thinking mode for reasoning models. Can be true/false or "high", "medium", "low"
truncate
boolean
default:"true"
Truncate chat history if prompt exceeds context length
shift
boolean
default:"true"
Shift chat history when hitting context length instead of erroring
logprobs
boolean
default:"false"
Return log probabilities for output tokens
top_logprobs
integer
default:"0"
Number of most likely tokens to return at each position (0-20). Requires logprobs: true

Response

Response Fields

model
string
The model name used for generation
created_at
string
Timestamp of when the response was created (ISO 8601 format)
message
object
The generated message
done
boolean
Whether the response is complete
done_reason
string
Reason for completion: stop, length, load, unload
total_duration
integer
Total time spent generating response (nanoseconds)
load_duration
integer
Time spent loading the model (nanoseconds)
prompt_eval_count
integer
Number of tokens in the prompt
prompt_eval_duration
integer
Time spent evaluating the prompt (nanoseconds)
eval_count
integer
Number of tokens generated
eval_duration
integer
Time spent generating response (nanoseconds)
logprobs
array
Log probability information for each token (when logprobs: true)

Examples

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "stream": false
}'

Example Response

{
  "model": "llama3.2",
  "created_at": "2024-02-24T12:34:56.789Z",
  "message": {
    "role": "assistant",
    "content": "The sky appears blue due to Rayleigh scattering..."
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 5432109876,
  "load_duration": 123456789,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 987654321,
  "eval_count": 298,
  "eval_duration": 4321098765
}

Streaming Example

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "Tell me a story"
    }
  ]
}'

Error Responses

error
string
Error message describing what went wrong

Common Errors

  • 400 Bad Request: Invalid request body or model name
  • 404 Not Found: Model not found (need to pull it first)
  • 500 Internal Server Error: Server error during generation
When streaming is disabled ("stream": false), the complete response is returned as a single JSON object. With streaming enabled (default), responses are sent as newline-delimited JSON (NDJSON).

Build docs developers (and LLMs) love