Introduction

ggml is a tensor library written in C that provides the computational foundation for running machine learning models on diverse hardware. It is the engine behind projects like llama.cpp and whisper.cpp.

ggml is under active development. Some development happens directly in the llama.cpp and whisper.cpp repositories before being merged back here.

Key features

Zero dependencies

The core library has no third-party dependencies, making it straightforward to embed in any project.

Integer quantization

Supports integer quantization (Q2–Q8) to reduce memory usage and accelerate inference on constrained hardware.

Automatic differentiation

Supports forward and backward passes for training, with AdamW and SGD optimizers built in via ggml-opt.h.

Multi-backend

Computation can be dispatched to CPU, CUDA, Metal, Vulkan, HIP, SYCL, and more via a unified backend scheduler.

Zero runtime allocations

Memory is allocated once at initialization. No allocations occur during the compute loop, giving predictable latency.

GGUF format

First-class support for the GGUF file format used to distribute quantized model weights.

Architecture overview

ggml computation follows a four-stage pipeline: 1. Contexts — A ggml_context owns a fixed-size memory buffer. All tensors and graph metadata are allocated from this buffer via ggml_init(). 2. Tensors — Tensors are allocated inside a context with functions like ggml_new_tensor_2d(). They hold shape, data type, stride, and a pointer into the context’s buffer. Up to 4 dimensions are supported. 3. Computation graphs — Tensor operations (e.g. ggml_mul_mat, ggml_add) do not execute immediately. Instead, each call appends a node to a ggml_cgraph that records the operation and its inputs. The graph is finalized with ggml_build_forward_expand(). 4. Backends — Execution is triggered by dispatching the graph to a backend. The legacy path calls ggml_graph_compute_with_ctx() directly on the CPU. The modern path uses ggml_backend_sched to route operations across multiple devices automatically.

#include "ggml.h"
#include "ggml-cpu.h"  // ggml_set_f32, ggml_get_f32_1d, ggml_graph_compute_with_ctx

// 1. Allocate a context
struct ggml_init_params params = {
    .mem_size   = 16*1024*1024,
    .mem_buffer = NULL,
};
struct ggml_context * ctx = ggml_init(params);

// 2. Create tensors
struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

// 3. Build a computation graph
struct ggml_tensor * result = ggml_add(ctx, a, b);
struct ggml_cgraph * gf     = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, result);

// 4. Compute (ggml_set_f32/ggml_get_f32_1d are in ggml-cpu.h)
ggml_set_f32(a, 1.0f);
ggml_set_f32(b, 2.0f);
ggml_graph_compute_with_ctx(ctx, gf, /*n_threads=*/1);

printf("result = %f\n", ggml_get_f32_1d(result, 0)); // 3.000000
ggml_free(ctx);

Next steps

Quickstart

Build ggml from source and run your first computation.

Get Started

Core Concepts

Backends

Training

File Formats

Examples

Introduction

Key features

Zero dependencies

Integer quantization

Automatic differentiation

Multi-backend

Zero runtime allocations

GGUF format

Architecture overview

Next steps

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Backends

Training

File Formats

Examples

​Key features

Zero dependencies

Integer quantization

Automatic differentiation

Multi-backend

Zero runtime allocations

GGUF format

​Architecture overview

​Next steps

Quickstart

Build docs developers (and LLMs) love

Key features

Architecture overview

Next steps