Lead Scoring Overview

What is Lead Scoring?

Lead scoring is a predictive model that assigns a numerical value to potential customers (leads) based on their likelihood to convert into paying customers. This project develops a machine learning-based lead scoring system for an Event Management SaaS application.

The model predicts the probability of conversion by analyzing historical data from the client’s sales process, helping prioritize high-value leads and optimize sales efforts.

Business Problem

The client operates an Event Management SaaS platform with a two-phase sales process:

Lead Generation Phase - Capturing potential clients through various acquisition channels
Offer Phase - Qualified leads who reach the demo meeting stage

The goal is to predict which leads are most likely to convert to paying customers, enabling the sales team to focus resources effectively.

The Sales Pipeline

Lead Acquisition

Potential clients enter the system through various sources (Inbound, Outbound) and are tracked in the leads.csv dataset with information about their use case, source, and geographic location.

Lead Qualification

Leads are evaluated and may be discarded, nurtured, or advanced to the demo meeting stage based on qualification criteria.

Demo & Offer

Qualified leads receive product demos and formal offers, tracked in the offers.csv dataset with pricing, pain points, and eventual outcomes.

Conversion Decision

The final status is determined: Closed Won (converted), Closed Lost (not converted), or Other (minority statuses).

Machine Learning Pipeline

The lead scoring system follows a comprehensive ML pipeline:

1. Data Integration

# Merge leads and offers datasets
full_dataset = pd.merge(offers_data, leads_data_cleaned, on='Id', how='left')

The two datasets are merged using unique identifiers to create a unified view of each lead’s journey from initial contact to final outcome.

2. Data Preprocessing

Key preprocessing steps include:

Missing value handling - Imputation strategies for categorical and numerical features
Feature engineering - Extracting temporal features from dates (year, month)
Column selection - Removing irrelevant or redundant features
Target mapping - Consolidating minority classes into “Other” category

3. Feature Transformation

Categorical Encoding
Numerical Scaling

# Label encoding for categorical variables
label_encoder = LabelEncoder()
categorical_columns = ['Source', 'City', 'Loss Reason', 
                      'Pain', 'Discount code', 'Status', 'Use Case']

for column in categorical_columns:
    full_dataset_preprocessed[column] = label_encoder.fit_transform(
        full_dataset_preprocessed[column]
    )

Categorical features are encoded into numerical values for model compatibility.

# StandardScaler for numerical features
ct = ColumnTransformer([
    ('se', StandardScaler(), ['Price', 'Discount code'])
], remainder='passthrough')

Numerical features are standardized to have zero mean and unit variance.

4. Model Selection & Training

Multiple classification algorithms are evaluated using cross-validation:

Random Forest
AdaBoost
Extra Trees
Bagging Classifier
Gradient Boosting (selected as best performer)
Decision Tree
Naive Bayes
K-Nearest Neighbors
Logistic Regression
SGD Classifier
MLP Classifier
Support Vector Machine

The Gradient Boosting model achieved the highest cross-validation score of 0.91 and was selected as the final model.

5. Model Evaluation

The trained model produces:

# Model predictions with probabilities
y_probabilities = best_model.predict_proba(X_test)
y_predicted = np.argmax(y_probabilities, axis=1)

Performance Metrics:

Accuracy: 90.4%
Detailed precision, recall, and F1-scores for each class
Probability distributions for lead prioritization

Target Variable: Status

The model predicts one of three outcomes:

Closed Won

The lead successfully converted into a paying customer. This is the primary positive outcome the model aims to predict.

Closed Lost

The lead did not convert and the opportunity was lost. Understanding these cases helps identify risk factors.

Other

Minority status categories grouped together to address class imbalance. These represent edge cases in the sales process.

Key Use Cases

The lead scoring model enables several business applications:

Lead Prioritization - Rank leads by conversion probability to focus sales efforts
Resource Allocation - Assign appropriate resources based on lead quality
Campaign Optimization - Identify which acquisition channels produce high-quality leads
Risk Assessment - Detect early warning signs of potential losses
Sales Forecasting - Predict pipeline conversion rates more accurately

Data Flow Architecture

Next Steps

Learn about the dataset structure and field definitions

Get Started

Core Concepts

Data Preparation

What is Lead Scoring?

Business Problem

The Sales Pipeline

Machine Learning Pipeline

1. Data Integration

2. Data Preprocessing

3. Feature Transformation

4. Model Selection & Training

5. Model Evaluation

Target Variable: Status

Key Use Cases

Data Flow Architecture

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Data Preparation

​What is Lead Scoring?

​Business Problem

​The Sales Pipeline

​Machine Learning Pipeline

​1. Data Integration

​2. Data Preprocessing

​3. Feature Transformation

​4. Model Selection & Training

​5. Model Evaluation

​Target Variable: Status

​Key Use Cases

​Data Flow Architecture

Next Steps

Build docs developers (and LLMs) love

What is Lead Scoring?

Business Problem

The Sales Pipeline

Machine Learning Pipeline

1. Data Integration

2. Data Preprocessing

3. Feature Transformation

4. Model Selection & Training

5. Model Evaluation

Target Variable: Status

Key Use Cases

Data Flow Architecture