Yellow Taxi NYC Data Analytics
A Python-based analytics tool for processing, cleaning, and analyzing NYC Yellow Taxi trip data. This tool downloads parquet files directly from the NYC TLC Trip Record Data repository, processes millions of records, and generates comprehensive metrics reports.What Does It Do?
The Yellow Taxi Data Analytics tool transforms raw NYC taxi trip data into actionable insights through an automated pipeline:- Downloads monthly parquet files from NYC’s official data source
- Cleans data by removing duplicates, invalid trips, and outliers
- Processes millions of trip records with optimized pandas operations
- Generates weekly and monthly metrics across multiple dimensions
- Exports results to CSV and Excel formats for easy analysis
Key Capabilities
Data Import & Cleaning
- Automatic download from NYC TLC Trip Record Data CDN
- Intelligent filtering of essential columns (datetime, distance, fare, passenger count)
- Data validation and quality checks:
- Removes trips with invalid timestamps
- Filters trips shorter than 60 seconds
- Excludes trips exceeding 100 mph average speed
- Validates fare amounts (5000 range)
Metrics Generation
Weekly Metrics:- Trip time statistics (min, max, mean)
- Trip distance statistics
- Fare amount statistics
- Total service counts
- Week-over-week percentage variations
- Regular trips (RateCodeID: 1)
- JFK Airport trips (RateCodeID: 2)
- Other rate types
- Segmented by weekday vs. weekend
- Service counts, total distances, passenger counts
Export Formats
CSV Export (processed_data.csv):
- Pipe-delimited weekly metrics
- Complete time series with percentage variations
processed_data.xlsx):
- Multi-sheet workbook
- Separate sheets for JFK, Regular, and Other rate types
- Ready for pivot tables and further analysis
Who Should Use This?
- Data Analysts studying NYC transportation patterns
- Researchers analyzing urban mobility trends
- Business Analysts evaluating taxi service metrics
- Students learning pandas and data processing techniques
- Developers building transportation analytics applications
Architecture Overview
The tool follows a clean, sequential processing pipeline:Performance
Processing 3 months of data (January-March 2022) with millions of trip records:- Total execution time: ~53 seconds
- Import: ~7.5 seconds
- Cleaning: ~5.6 seconds
- Column generation: ~30.6 seconds
- Metrics calculation: ~9.2 seconds
- Export: <0.1 seconds
Next Steps
Quickstart
Get your first analysis running in under 60 seconds
Installation
Set up your environment and install dependencies