Skip to main content

Quickstart Guide

Get started with Ollama and run your first large language model locally in just a few minutes.
This guide assumes you have already installed Ollama. If not, see the Installation Guide first.

Your First Model

1

Run Ollama

The simplest way to get started is to run a model directly:
ollama run gemma3
This command will:
  • Download the Gemma 3 model (if not already downloaded)
  • Start the Ollama server (if not already running)
  • Begin an interactive chat session
On first run, the model will be downloaded. This may take a few minutes depending on your internet connection.
2

Chat with the Model

Once the model is loaded, you’ll see a prompt:
>>> Send a message (/? for help)
Try asking a question:
>>> Why is the sky blue?
The model will stream its response in real-time.
Type /help to see all available chat commands, or press Ctrl+D to exit.
3

Try Other Models

Explore different models from the library:
# Run Llama 3.2 (Meta's model)
ollama run llama3.2

# Run Mistral (efficient and fast)
ollama run mistral

# Run Phi-3 (Microsoft's compact model)
ollama run phi3
Browse all available models →

Using the REST API

Ollama provides a REST API server that runs on http://localhost:11434 by default.

Generate a Response

curl http://localhost:11434/api/generate -d '{
  "model": "gemma3",
  "prompt": "Why is the sky blue?"
}'

Chat with Context

Maintain conversation history:
curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "stream": false
}'

Essential Commands

List Downloaded Models

ollama list
Output:
NAME              ID            SIZE      MODIFIED
gemma3:latest     a80c4f17acd5  2.0 GB    2 hours ago
llama3.2:latest   0a8c26691023  4.7 GB    1 day ago

Pull a Model

Download a model without running it:
ollama pull mistral

Remove a Model

Free up disk space:
ollama rm mistral

Check Running Models

See which models are currently loaded in memory:
ollama ps
Output:
NAME              ID            SIZE      PROCESSOR    CONTEXT    UNTIL
gemma3:latest     a80c4f17acd5  2.0 GB    100% GPU     4096       4 minutes from now

Multimodal Models

Some models can process images along with text:
ollama run llava "What's in this image? /path/to/image.jpg"
Or via the API:
curl http://localhost:11434/api/generate -d '{
  "model": "llava",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_data"]
}'

Non-Interactive Mode

Run one-off prompts without entering chat mode:
# Single prompt
ollama run gemma3 "Explain quantum computing in simple terms"

# Pipe input
echo "Write a haiku about programming" | ollama run gemma3

# Save output
ollama run gemma3 "Generate a Python function" > output.py

Model Options

Customize model behavior with options:
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3",
  "prompt": "Tell me a story",
  "options": {
    "temperature": 0.8,
    "top_p": 0.9,
    "seed": 42
  }
}'
  • temperature (0.0-2.0): Controls randomness (default: 0.8)
  • top_p (0.0-1.0): Nucleus sampling threshold (default: 0.9)
  • seed: Set for reproducible outputs
  • num_predict: Maximum tokens to generate
  • stop: Custom stop sequences

Keep Alive

Control how long models stay in memory:
# Keep loaded for 10 minutes
ollama run gemma3 --keepalive 10m

# Unload immediately after use
ollama run gemma3 --keepalive 0

# Keep loaded indefinitely
ollama run gemma3 --keepalive -1
Default is 5 minutes after last use.

Server Management

Start the Server

The server starts automatically when you run a model, but you can start it manually:
ollama serve
The server runs on http://localhost:11434 by default.

Environment Variables

  • OLLAMA_HOST: Change the bind address (default: 127.0.0.1:11434)
  • OLLAMA_MODELS: Custom model storage location
  • OLLAMA_NUM_PARALLEL: Number of parallel requests (default: 1)
  • OLLAMA_MAX_LOADED_MODELS: Max models in memory (default: 1)
  • OLLAMA_DEBUG: Enable debug logging
Example:
export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Troubleshooting

Model sizes range from 2GB to 70GB+. Use a wired connection for faster downloads. You can resume interrupted downloads by running the same command again.
Try a smaller quantized model:
  • Use gemma3:2b instead of gemma3:8b
  • Use Q4 quantization: llama3.2:3b-q4_0
  • Reduce context size with num_ctx option
The server may not be running:
ollama serve
Or check if another process is using port 11434.
On Linux, ensure NVIDIA drivers are installed. On macOS, Metal should work automatically. Check with:
ollama run gemma3 --verbose

What’s Next?

Create Custom Models

Customize models with system prompts and parameters

API Reference

Complete REST API documentation

Import Models

Import models from PyTorch or Safetensors

CLI Reference

Complete command-line documentation
Join the Discord community for help and to share your projects!

Build docs developers (and LLMs) love