Skip to main content
Llama 2 is a collection of pretrained and fine-tuned large language models ranging from 7 billion to 70 billion parameters.

Model Sizes

Llama 2 is available in three parameter sizes:
Model SizeModel Parallel (MP)Use Case
7B1Lightweight deployment, single GPU inference
13B2Balanced performance and resource usage
70B8Maximum performance, enterprise applications

Context Length

All Llama 2 models support a maximum sequence length of 4096 tokens. When running inference, you can set max_seq_len according to your hardware constraints, as the cache is pre-allocated based on this value and max_batch_size.

Model Types

Llama 2 comes in two variants:

Pretrained Models

Base models trained on 2 trillion tokens from publicly available online data:
  • Training: Pretrained only, no fine-tuning
  • Use Case: Natural language generation tasks where you need text completion
  • Prompting: Requires prompts where the expected answer is the natural continuation
  • Models: llama-2-7b, llama-2-13b, llama-2-70b

Chat Models

Fine-tuned versions optimized for dialogue applications:
  • Training: Pretrained + Supervised Fine-Tuning (SFT) + Reinforcement Learning with Human Feedback (RLHF)
  • Use Case: Conversational AI, chat assistants, Q&A systems
  • Prompting: Requires specific dialog formatting with special tags
  • Models: llama-2-7b-chat, llama-2-13b-chat, llama-2-70b-chat

Architecture

Llama 2 uses an optimized transformer architecture:
  • Type: Auto-regressive language model
  • Training Data: 2.0T tokens for all sizes
  • Special Features: The 70B model uses Grouped-Query Attention (GQA) for improved inference scalability
  • Training Period: January 2023 to July 2023
  • Data Cutoff: September 2022 (pretraining), up to July 2023 (fine-tuning data)

Performance Benchmarks

Performance on academic benchmarks (Llama 2 pretrained models):
ModelCodeCommonsense ReasoningWorld KnowledgeReading ComprehensionMathMMLU
7B16.863.948.961.314.645.3
13B24.566.955.465.828.754.8
70B37.571.963.669.435.268.9

Choosing a Model

Use Pretrained Models when:
  • You need natural text continuation or completion
  • You’re building custom applications with specific prompting patterns
  • You plan to fine-tune for your specific domain
Use Chat Models when:
  • Building conversational interfaces
  • Implementing Q&A systems
  • Creating assistant-like applications
  • You need safety-aligned responses

Build docs developers (and LLMs) love