Overview
Thevision.py module handles all interactions with OpenAI’s GPT-4V (Vision) API. It encodes screenshots, constructs prompts, parses JSON responses, and implements fallback error handling for malformed outputs.
Image preparation
Encoding and resizing
Screenshots must be base64-encoded before sending to the API. vimGPT also resizes images to control token usage:vision.py
Why 1080px? This resolution provides enough detail for GPT-4V to read Vimium hint characters while keeping token costs manageable. Lower resolutions cause detection failures.
Prompt engineering
The core of vimGPT’s intelligence comes from a carefully crafted prompt that instructs GPT-4V on:- Available actions (navigate, type, click, done)
- How to format responses (JSON only)
- How to interpret Vimium overlays (yellow character sequences)
- When to signal completion
The full prompt
vision.py
Prompt breakdown
Task context
Task context
"You need to choose which action to take to help a user do this task: {objective}"Grounds the model in the user’s goal (e.g., “search for machine learning papers”)Action definitions
Action definitions
- navigate: Go to a URL
- type: Enter text and press Enter
- click: Type the Vimium hint characters
- done: Task complete
Vimium instructions
Vimium instructions
"return the string with the yellow character sequence you want to click on"Teaches the model to read Vimium overlays and return hint characters like “AB” or “F”Format enforcement
Format enforcement
"You must respond in JSON only with no other fluff or bad things will happen. Do not return the JSON inside a code block."Attempts to force pure JSON output (though this doesn’t always work)Combined actions
Combined actions
"For typing, please return a click to click on the box along with a type with the message to write"Handles cases where the model needs to click an input field before typingJSON mode unavailable: At the time of development, GPT-4V didn’t support JSON mode or function calling, requiring prompt-based enforcement.
API configuration
vision.py
.env file for security.
Token limits
vision.py
{"click": "AB"}, 100 tokens is sufficient.
Response parsing
Handling valid JSON
vision.py
Example valid responses
Error handling and repair
Despite prompt engineering, GPT-4V sometimes returns malformed JSON (e.g., wrapped in code blocks, containing comments, etc.). vimGPT implements a fallback repair mechanism:vision.py
Common response patterns
Based on the prompt and Vimium integration, GPT-4V typically returns:Simple click
Search query
Navigation
Completion
Testing the vision module
The module includes a standalone test:vision.py
Performance considerations
Latency
- Encoding: ~10-50ms for image processing
- API call: ~2-5 seconds for GPT-4V inference
- Repair call: Additional 1-3 seconds if JSON parsing fails
Token usage
Vision API tokens are calculated based on:- Image resolution (higher = more tokens)
- Prompt length
- Response length (minimal due to
max_tokens=100)
Cost estimation
With GPT-4V pricing:- Input: ~$0.01 per image + prompt
- Output: ~$0.03 per 1K tokens (minimal due to short responses)
- ~$0.01-0.02 per action depending on image size
Completing a multi-step task (5-10 actions) costs approximately $0.05-0.20 in API fees.
Limitations and future work
Current limitations
Potential improvements
From the GitHub README:- Use Assistant API: Once it supports vision, maintain conversation history for context
- Fine-tune open-source models: Use LLaVa, CogVLM, or Fuyu-8B for faster/cheaper inference
- Higher resolution: Better element detection, but requires more tokens
- Hybrid approach: Have GPT-4V return natural language instructions, then use JSON mode GPT-4 to formalize them
- Add accessibility tree: Provide DOM structure alongside screenshots for additional context
- Visual question answering: Return information to the user instead of just executing actions
Dependencies
requirements.txt
- openai: Official OpenAI Python client
- Pillow: Image processing and encoding
- python-dotenv: Environment variable management