Architecture Overview
The scraping engine is built around theImdbScraper class, which implements the ScraperInterface and follows Clean Architecture principles:
infrastructure/scraper/imdb_scraper.py:21
Dual-Source Data Collection
HTML Parsing Strategy
The scraper extracts movie IDs from the IMDb Top 250 chart using CSS selectors:infrastructure/scraper/imdb_scraper.py:146
GraphQL API Integration
To supplement HTML data, the scraper queries IMDb’s GraphQL endpoint:infrastructure/scraper/imdb_scraper.py:158
GraphQL Configuration:
- Endpoint:
https://caching.graphql.imdb.com/ - Operation:
Top250MoviesPagination - Hash:
2db1d515844c69836ea8dc532d5bff27684fdce990c465ebf52d36d185a187b3 - Locale:
en-US
BeautifulSoup Selectors
The engine uses CSS selectors configured inshared/config/config.py:
Data Extraction Logic
infrastructure/scraper/imdb_scraper.py:85
Duration Parsing
The scraper handles IMDb’s varied duration formats (e.g., “2h 30m”, “1h 45m”, “90m”):infrastructure/scraper/imdb_scraper.py:100
Actor Extraction
The scraper extracts the top 3 actors from each movie:infrastructure/scraper/imdb_scraper.py:115
Error Handling & Retry Logic
Robust Request Handling
All HTTP requests use themake_request utility with exponential backoff:
infrastructure/scraper/imdb_scraper.py:71
Retry Configuration
shared/config/config.py:50
Fallback Strategy
The request utility implements a multi-layer fallback:- Primary: Premium proxy (DataImpulse)
- Fallback: TOR network with IP rotation
- Final: Direct connection through VPN
infrastructure/scraper/utils.py:34
Data Validation
Before persisting, the scraper validates extracted data:infrastructure/scraper/imdb_scraper.py:58
Traffic Monitoring
The scraper tracks bandwidth usage:infrastructure/scraper/imdb_scraper.py:81
Configuration Options
Key configuration options inshared/config/config.py:
Next Steps
Network Evasion
Learn about the multi-layer proxy and TOR setup
Concurrency
Explore parallel processing with ThreadPoolExecutor