Overview
The pipeline generates detailed profiling artifacts that break down performance at the operator level, enabling fine-grained optimization and bottleneck identification.Generated Profiling Artifacts
After runningrun_all(), the pipeline creates three main profiling outputs:
operator_profile.csv
Per-chunk operator-level timing breakdown located in output_dir/profiles/:
preprocess_s: Time spent in data cleaning and preprocessingfeature_engineering_s: Time spent building derived featuresfeature_selection_s: Time spent in multicollinearity detection and feature filteringencode_scale_s: Time spent in one-hot encoding and scalingestimated_input_bandwidth_mb_s: Calculated input data bandwidth (MB/s)input_bytes: Raw input data size for the chunk
streaming_chunks.csv
Chunk-level latency, throughput, and memory observations in output_dir/benchmarks/:
pipeline_report.json
Aggregate telemetry and operator profile summary in output_dir/reports/:
Operator-Level Profiling
Implementation
The_profile_stream_chunk() method in engine.py:84-106 measures wall-clock time for each pipeline stage:
Identifying Bottlenecks
Use the operator profile to identify dominant stages:Hardware Telemetry
HardwareMonitor Class
TheHardwareMonitor class (monitor.py:16-76) provides fallback-safe hardware telemetry:
TelemetrySnapshot Structure
Frommonitor.py:8-13:
RAPL Energy Measurement
On Linux systems with Intel RAPL support, the monitor reads energy counters from:monitor.py:30-45:
- Batch mode: 45W average power
- Streaming mode: 30W average power
Memory Hierarchy and Cache Effects
The project does not capture PMU (Performance Monitoring Unit) counters directly, but practical signals help identify memory bottlenecks:Cache Pressure Indicators
-
Increased
encode_scale_swith larger chunks- One-hot encoding creates sparse matrices that stress cache
- Try reducing chunk size if this stage dominates
-
Rising end-to-end latency with stable throughput
- May indicate memory copy overhead
- Check for unnecessary DataFrame copies
-
Divergence between bandwidth estimate and throughput
- Could indicate storage or serialization bottlenecks
- Monitor
estimated_input_bandwidth_mb_svs actual I/O speed
Bandwidth Estimation
Fromengine.py:282-283:
bandwidth = input_bytes_mb / latency_s
Interpretation:
- High bandwidth (> 1000 MB/s): Good cache utilization
- Low bandwidth (< 100 MB/s): Potential I/O or memory bottleneck
- Decreasing bandwidth over time: Growing memory pressure
Reproducible Profiling Command
To generate profiling artifacts with controlled parameters:--chunk-size: Size of streaming chunks (affects cache behavior)--max-memory-mb: Memory limit for adaptive sizing--max-compute-units: CPU constraint (0.0-1.0)--benchmark-runs: Number of profiling iterations--random-seed: Ensures reproducibility
Using Profiling Data for Optimization
Step 1: Identify Dominant Operator
Step 2: Analyze Chunk Size Impact
Compare operator times across different chunk sizes:Step 3: Investigate Memory Patterns
Limitations
From the source documentation:- No GPU profiling: GPU kernels are not currently measured
- User-space timing: All timing is wall-clock time in Python user space
- No quantization path: Direct quantization is not implemented
- Platform-dependent energy: RAPL counters only available on Intel Linux systems
- No PMU counters: Cache misses, branch mispredictions not captured
Next Steps
- Benchmarking - Full benchmark methodology
- Optimization Strategies - Tuning based on profiling data
- Constraint Experiments - Testing under resource limits