ggml is under active development. Some development happens directly in the
llama.cpp and
whisper.cpp repositories before
being merged back here.
Key features
Zero dependencies
The core library has no third-party dependencies, making it straightforward
to embed in any project.
Integer quantization
Supports integer quantization (Q2–Q8) to reduce memory usage and accelerate
inference on constrained hardware.
Automatic differentiation
Supports forward and backward passes for training, with AdamW and SGD
optimizers built in via
ggml-opt.h.Multi-backend
Computation can be dispatched to CPU, CUDA, Metal, Vulkan, HIP, SYCL, and
more via a unified backend scheduler.
Zero runtime allocations
Memory is allocated once at initialization. No allocations occur during the
compute loop, giving predictable latency.
GGUF format
First-class support for the GGUF file format used to distribute quantized
model weights.
Architecture overview
ggml computation follows a four-stage pipeline: 1. Contexts — Aggml_context owns a fixed-size memory buffer. All tensors
and graph metadata are allocated from this buffer via ggml_init().
2. Tensors — Tensors are allocated inside a context with functions like
ggml_new_tensor_2d(). They hold shape, data type, stride, and a pointer into
the context’s buffer. Up to 4 dimensions are supported.
3. Computation graphs — Tensor operations (e.g. ggml_mul_mat,
ggml_add) do not execute immediately. Instead, each call appends a node to a
ggml_cgraph that records the operation and its inputs. The graph is finalized
with ggml_build_forward_expand().
4. Backends — Execution is triggered by dispatching the graph to a backend.
The legacy path calls ggml_graph_compute_with_ctx() directly on the CPU. The
modern path uses ggml_backend_sched to route operations across multiple
devices automatically.
Next steps
Quickstart
Build ggml from source and run your first computation.
