Skip to main content
The IMDb Scraper uses a dual-source extraction strategy combining HTML parsing and GraphQL API calls to maximize data coverage and reliability.

Architecture Overview

The scraping engine is built around the ImdbScraper class, which implements the ScraperInterface and follows Clean Architecture principles:
class ImdbScraper(ScraperInterface):
    def __init__(
        self,
        use_case: UseCaseInterface,
        proxy_provider: ProxyProviderInterface,
        tor_rotator: TorInterface,
        engine: str,
        base_url: str = config.BASE_URL
    ):
        self.use_case = use_case
        self.proxy_provider = proxy_provider
        self.tor_rotator = tor_rotator
        self.engine = engine
        self.base_url = base_url
Location: infrastructure/scraper/imdb_scraper.py:21

Dual-Source Data Collection

HTML Parsing Strategy

The scraper extracts movie IDs from the IMDb Top 250 chart using CSS selectors:
# Extract movie IDs from HTML
html_ids = [
    a["href"].split("/")[2]
    for a in soup.select("td.titleColumn a")
    if "/title/" in a["href"]
]
Location: infrastructure/scraper/imdb_scraper.py:146

GraphQL API Integration

To supplement HTML data, the scraper queries IMDb’s GraphQL endpoint:
def _fetch_graphql_ids(self, cookies: Optional[requests.cookies.RequestsCookieJar]) -> List[str]:
    payload = {
        "operationName": config.GRAPHQL_OPERATION,
        "variables": { 
            "first": config.NUM_MOVIES, 
            "isInPace": False, 
            "locale": config.GRAPHQL_LOCALE 
        },
        "extensions": { 
            "persistedQuery": { 
                "sha256Hash": config.GRAPHQL_HASH, 
                "version": config.GRAPHQL_VERSION 
            } 
        }
    }

    response = make_request(
        url=config.GRAPHQL_URL,
        proxy_provider=self.proxy_provider,
        tor_rotator=self.tor_rotator,
        method="POST",
        json_payload=payload
    )
Location: infrastructure/scraper/imdb_scraper.py:158 GraphQL Configuration:
  • Endpoint: https://caching.graphql.imdb.com/
  • Operation: Top250MoviesPagination
  • Hash: 2db1d515844c69836ea8dc532d5bff27684fdce990c465ebf52d36d185a187b3
  • Locale: en-US

BeautifulSoup Selectors

The engine uses CSS selectors configured in shared/config/config.py:
SELECTORS = {
    "title": '[data-testid="hero__primary-text"]',
    "year": 'ul.ipc-inline-list li a[href*="releaseinfo"]',
    "rating": '[data-testid="hero-rating-bar__aggregate-rating__score"] span',
    "duration_container": 'ul.ipc-inline-list--show-dividers',
    "metascore": "span.metacritic-score-box",
    "actors": "a[data-testid='title-cast-item__actor']"
}

Data Extraction Logic

# Title extraction
title_tag = soup.select_one(config.SELECTORS.get("title", ""))
title = title_tag.text.strip() if title_tag else ""

# Year extraction with regex validation
year_tag = soup.select_one(config.SELECTORS.get("year", ""))
year_str = year_tag.text.strip("()") if year_tag else "0"
year_match = re.search(r'\d{4}', year_str)
year = int(year_match.group()) if year_match else 0

# Rating extraction
rating_tag = soup.select_one(config.SELECTORS.get("rating", ""))
rating = float(rating_tag.text.strip()) if rating_tag else 0.0

# Metascore extraction (optional field)
metascore_tag = soup.select_one(config.SELECTORS.get("metascore", ""))
metascore = int(metascore_tag.text.strip()) if metascore_tag else None
Location: infrastructure/scraper/imdb_scraper.py:85

Duration Parsing

The scraper handles IMDb’s varied duration formats (e.g., “2h 30m”, “1h 45m”, “90m”):
duration = None
ul_list = soup.select(config.SELECTORS.get("duration_container", ""))
for ul in ul_list:
    for li in ul.find_all("li"):
        text = li.get_text(strip=True).lower()
        if re.search(r"(\d+h|\d+m)", text):
            hours_match = re.search(r"(\d+)h", text)
            minutes_match = re.search(r"(\d+)m", text)
            h = int(hours_match.group(1)) if hours_match else 0
            m = int(minutes_match.group(1)) if minutes_match else 0
            duration = (h * 60) + m
            break
    if duration:
        break
Location: infrastructure/scraper/imdb_scraper.py:100

Actor Extraction

The scraper extracts the top 3 actors from each movie:
cast_tags = soup.select(config.SELECTORS.get("actors", ""))[:3]
actors = [
    Actor(id=None, name=cast.text.strip())
    for cast in cast_tags if cast.text.strip()
]
Location: infrastructure/scraper/imdb_scraper.py:115

Error Handling & Retry Logic

Robust Request Handling

All HTTP requests use the make_request utility with exponential backoff:
response = make_request(
    url=detail_url,
    proxy_provider=self.proxy_provider,
    tor_rotator=self.tor_rotator
)

if not response:
    logger.warning(f"No se pudo obtener respuesta para la URL: {detail_url}")
    return None
Location: infrastructure/scraper/imdb_scraper.py:71

Retry Configuration

MAX_RETRIES = 3
RETRY_DELAYS = [1, 3, 5]  # Exponential backoff in seconds
REQUEST_TIMEOUT = 10
BLOCK_CODES = [202, 403, 404, 429, 500]
Location: shared/config/config.py:50

Fallback Strategy

The request utility implements a multi-layer fallback:
  1. Primary: Premium proxy (DataImpulse)
  2. Fallback: TOR network with IP rotation
  3. Final: Direct connection through VPN
Location: infrastructure/scraper/utils.py:34

Data Validation

Before persisting, the scraper validates extracted data:
try:
    movie = self._scrape_movie_detail(indexed_id)
    if movie:
        self.use_case.execute(movie)
except ValueError as e:
    logger.warning(f"Datos inválidos para {imdb_id}: {e}. Saltando guardado.")
except Exception as e:
    logger.error(f"Error inesperado al procesar y guardar {imdb_id}: {e}", exc_info=True)
Location: infrastructure/scraper/imdb_scraper.py:58

Traffic Monitoring

The scraper tracks bandwidth usage:
self.total_bytes_used += len(response.content)

# At completion:
logger.info(f"Tráfico total usado: {self.total_bytes_used / (1024 ** 2):.2f} MB")
Location: infrastructure/scraper/imdb_scraper.py:81

Configuration Options

Key configuration options in shared/config/config.py:
# Scraping parameters
BASE_URL = "https://www.imdb.com"
CHART_TOP_PATH = "/chart/top/"
TITLE_DETAIL_PATH = "/title/{id}/"
NUM_MOVIES = 250

# Request settings
REQUEST_TIMEOUT = 10
MAX_RETRIES = 3
RETRY_DELAYS = [1, 3, 5]

# Concurrency
MAX_THREADS = 50

# User-Agent rotation
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/88.0.4324.96 Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0; Nexus 5) AppleWebKit/537.36 Chrome/90.0.4430.91 Mobile Safari/537.36"
]

Next Steps

Network Evasion

Learn about the multi-layer proxy and TOR setup

Concurrency

Explore parallel processing with ThreadPoolExecutor

Build docs developers (and LLMs) love