Skip to main content

What is simpE?

simpE is a lightweight benchmarking tool designed to evaluate small language models (LLMs) on fundamental cognitive tasks. Whether you’re testing models locally with LM-Studio or evaluating reasoning capabilities, simpE provides quick, reliable metrics on model performance.

Quick Start

Get up and running with simpE in minutes

Installation Guide

Detailed setup instructions and configuration

Benchmark Types

Learn about the three core benchmark areas

Analyzing Results

Understand your benchmark data

Benchmark Types

simpE evaluates language models across three fundamental capability areas:

1. String Reversal

Evaluates basic pattern manipulation by asking the model to reverse strings of varying lengths (2-30 characters). This tests:
  • Character-level attention
  • Sequential processing
  • Ability to follow simple transformations
# Example from the source code (main.py:147-152)
stringlenth = random.randint(2, 30)
text = ''.join(random.choice(string.ascii_uppercase + string.digits + string.ascii_lowercase) for _ in range(stringlenth))

prompt = f"Provide the following text in reverse order. Don't output anything else. Only output the reversed string without anything additional, not even quotes: \"{text}\""

2. Big Integer Addition

Challenges the model with arithmetic operations on large integers (2-30 digits each). This benchmark reveals:
  • Mathematical reasoning capabilities
  • Handling of large numbers
  • Ability to perform calculations without explanation
# Example from the source code (main.py:214-223)
int1_length = random.randint(2, 30)
int2_length = random.randint(2, 30)

int1 = int(''.join(random.choice(string.digits) for _ in range(int1_length)))
int2 = int(''.join(random.choice(string.digits) for _ in range(int2_length)))

prompt = f"Provide the sum of the two numbers. Don't output anything else. Only output the sum of the two numbers without anything additional. Only output the final number, no calculation, no explanation, just the final number without any text.: \"{int1}\" \"{int2}\""

3. String Rehearsal

Tests the model’s ability to reproduce longer strings (10-500 characters) exactly as provided. This measures:
  • Context retention
  • Exact replication capabilities
  • Attention to detail
# Example from the source code (main.py:292-297)
stringlenth = random.randint(10, 500)
text = ''.join(random.choice(string.ascii_uppercase + string.digits + string.ascii_lowercase) for _ in range(stringlenth))

prompt = f"Repeat the following string exactly without modifying it. Don't output anything else. Only output the string without anything additional, not even quotes: \"{text}\""

Key Features

Real-time Progress Tracking

simpE provides live console output showing:
  • Current benchmark progress (e.g., “String Reversal 45/100”)
  • Success rate percentage updated in real-time
  • Thinking time for reasoning models
  • Completion status for each benchmark

Reasoning Model Support

Built-in support for models with reasoning capabilities:
  • Configurable reasoning effort levels (low, medium, high)
  • Automatic capture of reasoning traces
  • Detailed reasoning statistics in analysis

Comprehensive Logging

All benchmark runs generate detailed logs:
  • Timestamped execution logs in logs/ directory
  • JSON results with full response data in results/ directory
  • Recent log file for quick access to latest run

Flexible Configuration

Easily adjust benchmark parameters in main.py:
tries = 100  # Number of tests per benchmark
timeout_time = 400  # Timeout in seconds
max_tokens = 512  # Maximum output tokens
reasoning_effort = "low"  # Reasoning level: low, medium, high
baseurl = "http://127.0.0.1:1234/v1"  # LM-Studio API endpoint

Analyzing Results

After running benchmarks, use the built-in analysis tool:
uv run analyze
The analyzer provides:
  • Accuracy metrics - Success percentage for each benchmark
  • Reasoning pattern analysis - Frequency of key phrases like “wait”, “actually”, “hold on”
  • Statistical insights - Average/median/min/max for reasoning trace lengths and word counts
Results are saved as JSON files in the results/ directory with timestamps and model information for easy comparison across runs.

Why simpE?

Simple Setup

No complex configuration - just install and run

Local-First

Works with LM-Studio for complete privacy

Fast Iteration

Quick benchmarks help you iterate on model selection

Detailed Insights

Rich logging and analysis for deep dives

Next Steps

1

Install simpE

Follow the installation guide to set up simpE and configure your API endpoint
2

Run Your First Benchmark

Check out the quick start guide to run your first benchmark suite
3

Analyze Results

Learn how to interpret and compare benchmark results

Build docs developers (and LLMs) love