Get Running in 5 Minutes
This guide will get the IMDb Scraper up and running using Docker Compose. The entire stack (scraper, PostgreSQL, TOR, VPN) will be orchestrated automatically.Prerequisites
Ensure you have the following installed:
Expected output:
- Docker (20.10+)
- Docker Compose (1.29+)
- Git (for cloning the repository)
Verify Docker Installation
Verify Docker Installation
Configure Environment Variables
Create a
.env file in the project root with the required configuration:.env
The proxy and VPN credentials above are examples. Replace them with your actual credentials, or remove the proxy configuration to use TOR as the default fallback.
Build the Docker Images
Build the scraper image with all dependencies:This will:
- Install Python dependencies from
requirements.txt - Set up the PostgreSQL client
- Configure the application structure
What's in requirements.txt?
What's in requirements.txt?
Start All Services
Launch the entire stack (PostgreSQL, TOR, VPN, and scraper):The orchestration will:
Use
docker-compose up -d to run in detached mode (background).- Start PostgreSQL on port 5432
- Initialize TOR proxy on ports 9050 (SOCKS) and 9051 (Control)
- Connect VPN via Gluetun (ProtonVPN to Argentina)
- Wait for dependencies using health checks
- Execute the scraper automatically
Verify Data Output
Once scraping completes, check the generated files:CSV Files (in Output:PostgreSQL Data:Expected output:
/data):What Just Happened?
The Docker Compose orchestration performed the following:Database Initialization
PostgreSQL container started with schema creation (
01_schema.sql), stored procedures (02_procedures.sql), and analytical views (03_views.sql)Network Stack
TOR proxy initialized for IP rotation, VPN connected to Argentina server, premium proxies configured with fallback logic
Data Extraction
250 movie IDs scraped from Top 250 chart, detail pages fetched concurrently (50 threads), data validated and parsed
Dual Persistence
CSV files written to
/data directory, relational data inserted into PostgreSQL with N:M actor-movie relationshipsArchitecture Flow
Next Steps
Installation Details
Learn about manual Python setup, system requirements, and troubleshooting
Environment Variables
Customize scraping behavior, network settings, and persistence options
Common Issues
Port 5432 already in use
Port 5432 already in use
If you have PostgreSQL running locally, change the port in Then update
.env:docker-compose.yml:VPN connection fails
VPN connection fails
Check your ProtonVPN credentials in
.env. If you don’t have VPN access, comment out the VPN service in docker-compose.yml and rely on TOR/proxies.Scraper exits immediately
Scraper exits immediately
Check logs with
docker-compose logs scraper. Common causes:- Database not ready (wait longer for healthcheck)
- Invalid proxy credentials
- Network connectivity issues
No data in CSV files
No data in CSV files
Ensure the scraper completed successfully. Check logs for validation errors or network failures. The scraper skips invalid data automatically.
Clean Up
Stop all containers and remove volumes:Ready to dive deeper? Check out the Installation Guide for manual setup or the Architecture Documentation to understand the design patterns.