Overview
screenpipe is built as a local-first, event-driven data capture system. All data is stored locally in SQLite, with an optional REST API for programmatic access. The core architecture consists of four main components:Event-Driven Capture
Unlike traditional screen recorders that poll at a fixed frame rate, screenpipe uses event-driven capture. It only captures a screenshot when something meaningful happens.Capture Triggers
screenpipe listens for these OS-level events:| Trigger | Debounce | Description |
|---|---|---|
| App switch | 300ms | User changed applications (highest-value event) |
| Window focus change | 300ms | New tab, document, or conversation opened |
| Mouse click | 200ms | User interacted—screen likely changed |
| Typing pause | 500ms after last key | Captures the result of typing, not every character |
| Scroll stop | 400ms after last scroll | New content scrolled into view |
| Clipboard copy | 200ms | User grabbed something—capture context |
| Idle fallback | Every 5s | Catches passive changes (notifications, incoming messages) |
Each trigger has a debounce period to prevent capture storms. For example, rapid clicking won’t create 20 captures—it’ll create at most 5 (200ms minimum interval).
Capture Flow
When an event triggers a capture:Event detected
The OS event listener (CGEventTap on macOS, SetWindowsHookEx on Windows) detects a meaningful event like an app switch or click.
Debounce and dedup
The event is debounced (e.g., 200-500ms depending on type) and deduplicated per monitor to prevent storms.
Paired capture
A paired capture runs:
- Screenshot: Captures the monitor image (~5ms)
- Accessibility tree walk: Extracts structured text from the focused window (~10-50ms on macOS, 200-500ms on Windows)
- OCR fallback (if accessibility is empty): Runs OCR on the screenshot (~100-500ms, rare)
Why Event-Driven?
Compared to continuous recording at 0.5-1 FPS:| Metric | Continuous (1 FPS) | Event-Driven |
|---|---|---|
| CPU usage (static screen) | 3-5% | < 0.5% |
| CPU usage (active use) | 8-15% | < 5% |
| Frames captured (8 hours) | 28,800 | ~3,840 |
| Storage (8 hours) | 800 MB - 1.6 GB | ~300 MB |
| Capture latency | 1-5 seconds | < 500ms |
Event-driven capture is the default and only mode. There is no FPS slider—capture happens when events occur.
Text Extraction
screenpipe extracts text from your screen using two methods:1. Accessibility Tree (Primary)
The accessibility tree is the structured representation of UI that screen readers use. It contains:- Button labels
- Text field content
- Menu items
- Window titles
- Structured roles (button, text, list, etc.)
- Fast: 10-50ms per capture (macOS), 200-500ms (Windows)
- Accurate: Text is already parsed by the OS
- Structured: Knows what’s a button vs. body text
- Native OS apps (Finder, System Settings, etc.)
- Browsers (Chrome, Safari, Firefox)
- Electron apps (VS Code, Slack, Discord)
- Most modern apps with accessibility support
2. OCR (Fallback)
When accessibility data is unavailable or empty, screenpipe falls back to Optical Character Recognition (OCR):- macOS: Apple Vision framework (fast, accurate)
- Windows: Windows native OCR
- Linux: Tesseract
- Image-heavy apps (Figma, Photoshop)
- PDF viewers rendering as canvas
- Video players showing text
- Games and remote desktop sessions
- Apps with broken/missing accessibility support
Text Storage
Extracted text is stored in two places:framestable: Theaccessibility_textorocr_textcolumn stores the text directly on the frame row- Full-text search index: SQLite FTS5 index for fast keyword search
Audio Pipeline
screenpipe captures and transcribes audio in real-time:Audio capture
- System audio: What you hear (Zoom, Spotify, YouTube)
- Microphone: What you say
Speech-to-text
Each 30-second chunk is transcribed using OpenAI Whisper (running locally):
- Model:
baseorsmall(configurable) - Speed: ~2-5x real-time (30s audio transcribed in 6-15s)
- Languages: 50+ languages supported
Speaker diarization
Whisper identifies different speakers in the audio:
- Labels speakers as Speaker 1, Speaker 2, etc.
- Works best with clear audio and distinct voices
Audio transcription is optional. You can disable it in settings to save CPU and storage.
Storage Layer
All data is stored locally in two places:1. SQLite Database
Location:~/.screenpipe/db.sqlite
Key tables:
| Table | Purpose |
|---|---|
frames | Screenshot metadata (timestamp, app, window, trigger, accessibility text) |
ocr_text | OCR results (when accessibility fallback is used) |
audio_transcriptions | Audio transcription segments with speaker labels |
ui_events | User input events (clicks, keystrokes, clipboard) |
meetings | Detected meetings with duration and attendees |
2. JPEG Snapshots
Location:~/.screenpipe/data/YYYY-MM-DD/
Each capture writes a JPEG directly to disk:
- Quality: JPEG quality 80 (configurable)
- Size: ~80 KB per frame (1080p)
- Retention: Configurable auto-delete (e.g., delete frames older than 30 days)
Older versions of screenpipe stored frames in H.265 video chunks. New captures use JPEG snapshots, but old video-based frames are still readable via FFmpeg extraction.
REST API
screenpipe exposes a local REST API onlocalhost:3030 for programmatic access:
Core Endpoints
JavaScript SDK
screenpipe provides a TypeScript SDK for easy API access:Plugin System (Pipes)
Pipes are scheduled AI agents defined as markdown files. Each pipe runs on a schedule and can:- Query screenpipe data via the API
- Call external APIs
- Write files
- Send notifications
Pipe Structure
A pipe is apipe.md file with YAML frontmatter:
Data Permissions
Each pipe supports deterministic access control via YAML frontmatter:| Field | Description |
|---|---|
allow-apps | Whitelist of apps the pipe can access (glob patterns) |
deny-apps | Blacklist of apps |
deny-windows | Blacklist of window titles (e.g., *password*) |
allow-content-types | Restrict to ocr, audio, input, or accessibility |
time-range | Time range the pipe can access (e.g., 09:00-18:00) |
days | Days of the week (e.g., Mon,Tue,Wed,Thu,Fri) |
allow-raw-sql | Allow raw SQL queries (default: false) |
allow-frames | Allow access to raw frame images (default: false) |
Built-in Pipes
- obsidian-sync: Sync activity to Obsidian vault
- reminders: Scan activity for TODOs and create Apple Reminders
- meeting-summary: Auto-generate meeting summaries
- time-breakdown: Generate time tracking reports by app
- idea-tracker: Surface startup ideas from browsing + market trends
Platform-Specific Implementation
screenpipe is cross-platform but uses platform-specific APIs for optimal performance:| Component | macOS | Windows | Linux |
|---|---|---|---|
| Event detection | CGEventTap | SetWindowsHookEx | X11/Wayland hooks |
| Screenshot | ScreenCaptureKit | DXGI/GDI | X11/PipeWire |
| Accessibility | AX API | UI Automation | AT-SPI |
| Audio capture | Core Audio | WASAPI | PipeWire |
| OCR | Apple Vision | Windows OCR | Tesseract |
~90% of screenpipe’s code is platform-agnostic Rust. Only event detection and capture APIs are platform-specific.
Security & Privacy
Local-First Architecture
- All data stays on your device by default
- No external servers or cloud dependencies
- SQLite database is not encrypted by default (stored in your home directory with OS-level permissions)
- Optional encrypted cloud sync uses zero-knowledge encryption
Network Isolation
- API only listens on
localhost:3030(not exposed to the network) - No telemetry or analytics sent to external servers
- All AI models can run locally via Ollama
Data Access Control
- Per-pipe permissions enforced at OS level
- Ignored windows list (skip sensitive apps like password managers)
- Optional data retention limits (auto-delete old frames)
screenpipe is open source (MIT license). You can audit the entire codebase at github.com/screenpipe/screenpipe.
Performance Characteristics
Typical resource usage on a modern machine (M1 MacBook Pro, 16 GB RAM):| Scenario | CPU | RAM | Disk I/O |
|---|---|---|---|
| Idle (static screen) | < 0.5% | 500 MB | Minimal |
| Active use (browsing, coding) | 3-7% | 1-2 GB | ~1 MB/s |
| Audio transcription | +5-10% | +500 MB | +500 KB/s |
| Initial indexing | 15-25% | 2-3 GB | 5-10 MB/s |
Performance degrades gracefully on lower-end hardware. Event-driven capture automatically reduces frequency if CPU usage exceeds thresholds.
Next Steps
API Reference
Explore all API endpoints
Pipes
Build AI agent plugins
MCP Server
Connect AI assistants
Contributing
Contribute to screenpipe