Quickstart

Get Running in 5 Minutes

This guide will get the IMDb Scraper up and running using Docker Compose. The entire stack (scraper, PostgreSQL, TOR, VPN) will be orchestrated automatically.

Prerequisites

Ensure you have the following installed:

Docker (20.10+)
Docker Compose (1.29+)
Git (for cloning the repository)

Verify Docker Installation

docker --version
docker-compose --version

Expected output:

Docker version 20.10.x
docker-compose version 1.29.x

Clone the Repository

Clone the IMDb Scraper repository to your local machine:

git clone https://github.com/frankdevg/imdb-scraper.git
cd imdb-scraper

Configure Environment Variables

Create a .env file in the project root with the required configuration:

.env

# PostgreSQL Configuration
POSTGRES_DB=imdb_scraper
POSTGRES_USER=aruiz
POSTGRES_PASSWORD=@ndresruiz@123
POSTGRES_PORT=5432
POSTGRES_HOST=postgres

# Premium Proxy Configuration (DataImpulse)
PROXY_HOST=gw.dataimpulse.com
PROXY_PORT=823
PROXY_USER=your_proxy_user
PROXY_PASS=your_proxy_password

# VPN Configuration (ProtonVPN)
VPN_PROVIDER=protonvpn
VPN_USERNAME=your_vpn_username
VPN_PASSWORD=your_vpn_password
VPN_COUNTRY=Argentina

The proxy and VPN credentials above are examples. Replace them with your actual credentials, or remove the proxy configuration to use TOR as the default fallback.

Never commit the .env file to version control. It’s already included in .gitignore for security.

Build the Docker Images

Build the scraper image with all dependencies:

docker-compose build --no-cache

This will:

Install Python dependencies from requirements.txt
Set up the PostgreSQL client
Configure the application structure

What's in requirements.txt?

requests
bs4
psycopg2-binary
python-dotenv
stem
requests[socks]

Start All Services

Launch the entire stack (PostgreSQL, TOR, VPN, and scraper):

docker-compose up

Use docker-compose up -d to run in detached mode (background).

The orchestration will:

Start PostgreSQL on port 5432
Initialize TOR proxy on ports 9050 (SOCKS) and 9051 (Control)
Connect VPN via Gluetun (ProtonVPN to Argentina)
Wait for dependencies using health checks
Execute the scraper automatically

Monitor Progress

Watch the scraper logs in real-time:

docker-compose logs -f scraper

You’ll see output like:

INFO - Inicializando contenedor de dependencias...
INFO - Construyendo scraper...
INFO - Iniciando proceso de scraping...
INFO - [HTML] IDs obtenidos: 250
INFO - [GraphQL] IDs obtenidos: 250
INFO - Obteniendo detalle de película 1/250: tt0111161
INFO - Película guardada: The Shawshank Redemption (1994)
INFO - Tráfico total usado: 15.42 MB
INFO - Proceso de scraping finalizado exitosamente.

Verify Data Output

Once scraping completes, check the generated files:CSV Files (in /data):

ls -lh data/

Output:

-rw-r--r-- 1 user user  45K movies.csv
-rw-r--r-- 1 user user  12K actors.csv
-rw-r--r-- 1 user user  18K movie_actor.csv

PostgreSQL Data:

docker exec -it imdb_postgres psql -U aruiz -d imdb_scraper -c "SELECT COUNT(*) FROM movies;"

Expected output:

 count 
-------
   250

Run Analytical Queries

The scraper automatically executes analytical SQL queries from sql/queries.sql. View results:

docker exec -it imdb_postgres psql -U aruiz -d imdb_scraper -f /docker-entrypoint-initdb.d/queries.sql

Sample query output:

               title                | year | rating | metascore 
------------------------------------+------+--------+-----------
 The Shawshank Redemption           | 1994 |    9.3 |        82
 The Godfather                      | 1972 |    9.2 |       100
 The Dark Knight                    | 2008 |    9.0 |        84

What Just Happened?

The Docker Compose orchestration performed the following:

Database Initialization

PostgreSQL container started with schema creation (01_schema.sql), stored procedures (02_procedures.sql), and analytical views (03_views.sql)

Network Stack

TOR proxy initialized for IP rotation, VPN connected to Argentina server, premium proxies configured with fallback logic

Data Extraction

250 movie IDs scraped from Top 250 chart, detail pages fetched concurrently (50 threads), data validated and parsed

Dual Persistence

CSV files written to /data directory, relational data inserted into PostgreSQL with N:M actor-movie relationships

Architecture Flow

Next Steps

Installation Details

Learn about manual Python setup, system requirements, and troubleshooting

Environment Variables

Customize scraping behavior, network settings, and persistence options

Common Issues

Port 5432 already in use

If you have PostgreSQL running locally, change the port in .env:

POSTGRES_PORT=5433

Then update docker-compose.yml:

ports:
  - "5433:5432"

VPN connection fails

Check your ProtonVPN credentials in .env. If you don’t have VPN access, comment out the VPN service in docker-compose.yml and rely on TOR/proxies.

Scraper exits immediately

Check logs with docker-compose logs scraper. Common causes:

Database not ready (wait longer for healthcheck)
Invalid proxy credentials
Network connectivity issues

No data in CSV files

Ensure the scraper completed successfully. Check logs for validation errors or network failures. The scraper skips invalid data automatically.

If scraping fails repeatedly, IMDb may have updated their HTML structure. Check the SELECTORS configuration in shared/config/config.py and update CSS selectors accordingly.

Clean Up

Stop all containers and remove volumes:

# Stop containers
docker-compose down

# Remove volumes (deletes database data)
docker-compose down -v

Ready to dive deeper? Check out the Installation Guide for manual setup or the Architecture Documentation to understand the design patterns.

Get Started

Architecture

Core Features

Data & SQL

Deployment

Get Running in 5 Minutes

What Just Happened?

Database Initialization

Network Stack

Data Extraction

Dual Persistence

Architecture Flow

Next Steps

Installation Details

Environment Variables

Common Issues

Clean Up

Build docs developers (and LLMs) love

Get Started

Architecture

Core Features

Data & SQL

Deployment

​Get Running in 5 Minutes

​What Just Happened?

Database Initialization

Network Stack

Data Extraction

Dual Persistence

​Architecture Flow

​Next Steps

Installation Details

Environment Variables

​Common Issues

​Clean Up

Build docs developers (and LLMs) love

Get Running in 5 Minutes

What Just Happened?

Architecture Flow

Next Steps

Common Issues

Clean Up