What is Lead Scoring?
Lead scoring is a predictive model that assigns a numerical value to potential customers (leads) based on their likelihood to convert into paying customers. This project develops a machine learning-based lead scoring system for an Event Management SaaS application.The model predicts the probability of conversion by analyzing historical data from the client’s sales process, helping prioritize high-value leads and optimize sales efforts.
Business Problem
The client operates an Event Management SaaS platform with a two-phase sales process:- Lead Generation Phase - Capturing potential clients through various acquisition channels
- Offer Phase - Qualified leads who reach the demo meeting stage
The Sales Pipeline
Lead Acquisition
Potential clients enter the system through various sources (Inbound, Outbound) and are tracked in the leads.csv dataset with information about their use case, source, and geographic location.
Lead Qualification
Leads are evaluated and may be discarded, nurtured, or advanced to the demo meeting stage based on qualification criteria.
Demo & Offer
Qualified leads receive product demos and formal offers, tracked in the offers.csv dataset with pricing, pain points, and eventual outcomes.
Machine Learning Pipeline
The lead scoring system follows a comprehensive ML pipeline:1. Data Integration
2. Data Preprocessing
Key preprocessing steps include:- Missing value handling - Imputation strategies for categorical and numerical features
- Feature engineering - Extracting temporal features from dates (year, month)
- Column selection - Removing irrelevant or redundant features
- Target mapping - Consolidating minority classes into “Other” category
3. Feature Transformation
- Categorical Encoding
- Numerical Scaling
4. Model Selection & Training
Multiple classification algorithms are evaluated using cross-validation:- Random Forest
- AdaBoost
- Extra Trees
- Bagging Classifier
- Gradient Boosting (selected as best performer)
- Decision Tree
- Naive Bayes
- K-Nearest Neighbors
- Logistic Regression
- SGD Classifier
- MLP Classifier
- Support Vector Machine
The Gradient Boosting model achieved the highest cross-validation score of 0.91 and was selected as the final model.
5. Model Evaluation
The trained model produces:- Accuracy: 90.4%
- Detailed precision, recall, and F1-scores for each class
- Probability distributions for lead prioritization
Target Variable: Status
The model predicts one of three outcomes:Closed Won
Closed Won
The lead successfully converted into a paying customer. This is the primary positive outcome the model aims to predict.
Closed Lost
Closed Lost
The lead did not convert and the opportunity was lost. Understanding these cases helps identify risk factors.
Other
Other
Minority status categories grouped together to address class imbalance. These represent edge cases in the sales process.
Key Use Cases
The lead scoring model enables several business applications:- Lead Prioritization - Rank leads by conversion probability to focus sales efforts
- Resource Allocation - Assign appropriate resources based on lead quality
- Campaign Optimization - Identify which acquisition channels produce high-quality leads
- Risk Assessment - Detect early warning signs of potential losses
- Sales Forecasting - Predict pipeline conversion rates more accurately
Data Flow Architecture
Next Steps
Learn about the dataset structure and field definitions