The Metrics module provides comprehensive evaluation metrics for assessing machine learning model performance. It includes metrics for classification, regression, and clustering tasks.
Overview
The metrics module offers evaluation functions for:
Classification : Accuracy, precision, recall, F1-score, ROC-AUC
Regression : MSE, MAE, R², RMSE, MAPE
Clustering : Silhouette score, adjusted rand index, mutual information
Visualization : Confusion matrix, ROC curve, precision-recall curve
Key Features
Comprehensive Metrics All standard ML evaluation metrics in one place.
Scikit-learn Compatible Familiar API matching scikit-learn conventions.
Multi-class Support Handles binary, multi-class, and multi-label problems.
Detailed Reports Generate comprehensive classification reports.
Classification Metrics
Accuracy
import { accuracy } from 'deepbox/metrics' ;
import { tensor } from 'deepbox/ndarray' ;
const yTrue = tensor ([ 0 , 1 , 1 , 0 , 1 , 0 ]);
const yPred = tensor ([ 0 , 1 , 0 , 0 , 1 , 1 ]);
const acc = accuracy ( yTrue , yPred );
console . log ( acc ); // 0.6667 (4/6 correct)
Precision, Recall, F1-Score
import { precision , recall , f1Score } from 'deepbox/metrics' ;
const yTrue = tensor ([ 0 , 1 , 1 , 0 , 1 , 0 ]);
const yPred = tensor ([ 0 , 1 , 0 , 0 , 1 , 1 ]);
// Binary classification
const prec = precision ( yTrue , yPred );
const rec = recall ( yTrue , yPred );
const f1 = f1Score ( yTrue , yPred );
console . log ( `Precision: ${ prec . toFixed ( 3 ) } ` );
console . log ( `Recall: ${ rec . toFixed ( 3 ) } ` );
console . log ( `F1-Score: ${ f1 . toFixed ( 3 ) } ` );
// Multi-class with averaging
const yTrueMulti = tensor ([ 0 , 1 , 2 , 0 , 1 , 2 ]);
const yPredMulti = tensor ([ 0 , 2 , 1 , 0 , 1 , 2 ]);
const precMacro = precision ( yTrueMulti , yPredMulti , { average: 'macro' });
const precWeighted = precision ( yTrueMulti , yPredMulti , { average: 'weighted' });
Confusion Matrix
import { confusionMatrix } from 'deepbox/metrics' ;
const yTrue = tensor ([ 0 , 1 , 2 , 0 , 1 , 2 ]);
const yPred = tensor ([ 0 , 2 , 1 , 0 , 0 , 2 ]);
const cm = confusionMatrix ( yTrue , yPred );
console . log ( cm );
// [[2, 0, 0],
// [1, 0, 1],
// [0, 1, 1]]
// Visualize confusion matrix
import { plotConfusionMatrix } from 'deepbox/plot' ;
plotConfusionMatrix ( cm , [ 'Class 0' , 'Class 1' , 'Class 2' ]);
Classification Report
import { classificationReport } from 'deepbox/metrics' ;
const yTrue = tensor ([ 0 , 1 , 2 , 0 , 1 , 2 ]);
const yPred = tensor ([ 0 , 2 , 1 , 0 , 0 , 2 ]);
const report = classificationReport ( yTrue , yPred , {
labels: [ 0 , 1 , 2 ],
targetNames: [ 'Class 0' , 'Class 1' , 'Class 2' ]
});
console . log ( report );
// Prints precision, recall, f1-score, and support for each class
ROC Curve and AUC
import { rocCurve , rocAucScore } from 'deepbox/metrics' ;
import { tensor } from 'deepbox/ndarray' ;
const yTrue = tensor ([ 0 , 0 , 1 , 1 ]);
const yScore = tensor ([ 0.1 , 0.4 , 0.35 , 0.8 ]);
// ROC curve
const { fpr , tpr , thresholds } = rocCurve ( yTrue , yScore );
// AUC score
const auc = rocAucScore ( yTrue , yScore );
console . log ( `AUC: ${ auc . toFixed ( 3 ) } ` );
// Visualize ROC curve
import { plotRocCurve } from 'deepbox/plot' ;
plotRocCurve ( fpr , tpr , auc );
Precision-Recall Curve
import { precisionRecallCurve , averagePrecisionScore } from 'deepbox/metrics' ;
const yTrue = tensor ([ 0 , 0 , 1 , 1 ]);
const yScore = tensor ([ 0.1 , 0.4 , 0.35 , 0.8 ]);
const { precision , recall , thresholds } = precisionRecallCurve ( yTrue , yScore );
const ap = averagePrecisionScore ( yTrue , yScore );
console . log ( `Average Precision: ${ ap . toFixed ( 3 ) } ` );
// Visualize
import { plotPrecisionRecallCurve } from 'deepbox/plot' ;
plotPrecisionRecallCurve ( precision , recall , ap );
Additional Classification Metrics
import {
balancedAccuracyScore ,
cohenKappaScore ,
matthewsCorrcoef ,
hammingLoss ,
jaccardScore ,
logLoss
} from 'deepbox/metrics' ;
// Balanced accuracy (good for imbalanced datasets)
const balAcc = balancedAccuracyScore ( yTrue , yPred );
// Cohen's kappa (inter-rater agreement)
const kappa = cohenKappaScore ( yTrue , yPred );
// Matthews correlation coefficient
const mcc = matthewsCorrcoef ( yTrue , yPred );
// Hamming loss (fraction of wrong labels)
const hLoss = hammingLoss ( yTrue , yPred );
// Jaccard similarity
const jaccard = jaccardScore ( yTrue , yPred );
// Log loss (requires probability predictions)
const yProba = tensor ([[ 0.9 , 0.1 ], [ 0.2 , 0.8 ], ... ]);
const logloss = logLoss ( yTrue , yProba );
Regression Metrics
Mean Squared Error (MSE)
import { mse } from 'deepbox/metrics' ;
import { tensor } from 'deepbox/ndarray' ;
const yTrue = tensor ([ 3 , - 0.5 , 2 , 7 ]);
const yPred = tensor ([ 2.5 , 0.0 , 2 , 8 ]);
const mseValue = mse ( yTrue , yPred );
console . log ( `MSE: ${ mseValue . toFixed ( 4 ) } ` );
Root Mean Squared Error (RMSE)
import { rmse } from 'deepbox/metrics' ;
const rmseValue = rmse ( yTrue , yPred );
console . log ( `RMSE: ${ rmseValue . toFixed ( 4 ) } ` );
Mean Absolute Error (MAE)
import { mae } from 'deepbox/metrics' ;
const maeValue = mae ( yTrue , yPred );
console . log ( `MAE: ${ maeValue . toFixed ( 4 ) } ` );
R² Score (Coefficient of Determination)
import { r2Score } from 'deepbox/metrics' ;
const yTrue = tensor ([ 3 , - 0.5 , 2 , 7 ]);
const yPred = tensor ([ 2.5 , 0.0 , 2 , 8 ]);
const r2 = r2Score ( yTrue , yPred );
console . log ( `R²: ${ r2 . toFixed ( 4 ) } ` );
// Perfect prediction: r2 = 1.0
// Model as good as mean: r2 = 0.0
// Worse than mean: r2 < 0.0
Additional Regression Metrics
import {
adjustedR2Score ,
explainedVarianceScore ,
maxError ,
medianAbsoluteError ,
mape
} from 'deepbox/metrics' ;
const yTrue = tensor ([ 3 , - 0.5 , 2 , 7 ]);
const yPred = tensor ([ 2.5 , 0.0 , 2 , 8 ]);
// Adjusted R² (accounts for number of features)
const adjR2 = adjustedR2Score ( yTrue , yPred , 5 ); // 5 features
// Explained variance
const explVar = explainedVarianceScore ( yTrue , yPred );
// Maximum absolute error
const maxErr = maxError ( yTrue , yPred );
// Median absolute error (robust to outliers)
const medAE = medianAbsoluteError ( yTrue , yPred );
// Mean absolute percentage error
const mapeValue = mape ( yTrue , yPred );
Clustering Metrics
Silhouette Score
import { silhouetteScore , silhouetteSamples } from 'deepbox/metrics' ;
import { tensor } from 'deepbox/ndarray' ;
const X = tensor ([
[ 1 , 2 ], [ 1.5 , 1.8 ], [ 5 , 8 ], [ 8 , 8 ], [ 1 , 0.6 ], [ 9 , 11 ]
]);
const labels = tensor ([ 0 , 0 , 1 , 1 , 0 , 1 ]);
// Overall silhouette score (-1 to 1, higher is better)
const score = silhouetteScore ( X , labels );
console . log ( `Silhouette Score: ${ score . toFixed ( 3 ) } ` );
// Per-sample silhouette scores
const sampleScores = silhouetteSamples ( X , labels );
Adjusted Rand Index
import { adjustedRandScore } from 'deepbox/metrics' ;
const labelsTrue = tensor ([ 0 , 0 , 1 , 1 , 2 , 2 ]);
const labelsPred = tensor ([ 0 , 0 , 1 , 1 , 1 , 2 ]);
// Measure similarity between two clusterings (1.0 = perfect match)
const ari = adjustedRandScore ( labelsTrue , labelsPred );
import {
adjustedMutualInfoScore ,
normalizedMutualInfoScore
} from 'deepbox/metrics' ;
const labelsTrue = tensor ([ 0 , 0 , 1 , 1 , 2 , 2 ]);
const labelsPred = tensor ([ 0 , 0 , 1 , 1 , 1 , 2 ]);
// Adjusted mutual information
const ami = adjustedMutualInfoScore ( labelsTrue , labelsPred );
// Normalized mutual information
const nmi = normalizedMutualInfoScore ( labelsTrue , labelsPred );
Additional Clustering Metrics
import {
homogeneityScore ,
completenessScore ,
vMeasureScore ,
fowlkesMallowsScore ,
calinskiHarabaszScore ,
daviesBouldinScore
} from 'deepbox/metrics' ;
const labelsTrue = tensor ([ 0 , 0 , 1 , 1 , 2 , 2 ]);
const labelsPred = tensor ([ 0 , 0 , 1 , 1 , 1 , 2 ]);
// Homogeneity: each cluster contains only members of a single class
const homogeneity = homogeneityScore ( labelsTrue , labelsPred );
// Completeness: all members of a class are in the same cluster
const completeness = completenessScore ( labelsTrue , labelsPred );
// V-measure: harmonic mean of homogeneity and completeness
const vMeasure = vMeasureScore ( labelsTrue , labelsPred );
// Fowlkes-Mallows score
const fmi = fowlkesMallowsScore ( labelsTrue , labelsPred );
// Calinski-Harabasz index (requires data)
const ch = calinskiHarabaszScore ( X , labelsPred );
// Davies-Bouldin index (lower is better)
const db = daviesBouldinScore ( X , labelsPred );
Use Cases
Compare multiple models using metrics: import { accuracy , f1Score } from 'deepbox/metrics' ;
import { LogisticRegression , RandomForestClassifier } from 'deepbox/ml' ;
const models = [
new LogisticRegression (),
new RandomForestClassifier ({ nEstimators: 100 })
];
for ( const model of models ) {
model . fit ( XTrain , yTrain );
const yPred = model . predict ( XTest );
const acc = accuracy ( yTest , yPred );
const f1 = f1Score ( yTest , yPred );
console . log ( ` ${ model . constructor . name } :` );
console . log ( ` Accuracy: ${ ( acc * 100 ). toFixed ( 2 ) } %` );
console . log ( ` F1-Score: ${ f1 . toFixed ( 3 ) } ` );
}
Find optimal classification threshold: import { precisionRecallCurve , f1Score } from 'deepbox/metrics' ;
const yTrue = tensor ([ 0 , 0 , 1 , 1 , 1 ]);
const yScore = tensor ([ 0.1 , 0.4 , 0.35 , 0.8 , 0.9 ]);
const { precision , recall , thresholds } = precisionRecallCurve ( yTrue , yScore );
// Find threshold that maximizes F1-score
let bestF1 = 0 ;
let bestThreshold = 0.5 ;
for ( let i = 0 ; i < thresholds . size ; i ++ ) {
const yPred = yScore . greater ( thresholds . at ( i ));
const f1 = f1Score ( yTrue , yPred );
if ( f1 > bestF1 ) {
bestF1 = f1 ;
bestThreshold = thresholds . at ( i );
}
}
console . log ( `Best threshold: ${ bestThreshold } ` );
Evaluate clustering quality: import { KMeans } from 'deepbox/ml' ;
import { silhouetteScore } from 'deepbox/metrics' ;
const X = tensor ([ ... ]); // Your data
// Try different numbers of clusters
const scores = [];
for ( let k = 2 ; k <= 10 ; k ++ ) {
const kmeans = new KMeans ({ nClusters: k });
kmeans . fit ( X );
const labels = kmeans . labels ();
const score = silhouetteScore ( X , labels );
scores . push ({ k , score });
}
// Find optimal k
const best = scores . reduce (( a , b ) => a . score > b . score ? a : b );
console . log ( `Optimal k: ${ best . k } ` );
Complete Evaluation Example
import {
accuracy ,
precision ,
recall ,
f1Score ,
confusionMatrix ,
classificationReport ,
rocAucScore
} from 'deepbox/metrics' ;
import { RandomForestClassifier } from 'deepbox/ml' ;
import { trainTestSplit } from 'deepbox/preprocess' ;
import { tensor } from 'deepbox/ndarray' ;
// Load data
const X = tensor ([ ... ]);
const y = tensor ([ ... ]);
// Split data
const { XTrain , XTest , yTrain , yTest } = trainTestSplit ( X , y , {
testSize: 0.2 ,
randomState: 42
});
// Train model
const model = new RandomForestClassifier ({ nEstimators: 100 });
model . fit ( XTrain , yTrain );
// Predictions
const yPred = model . predict ( XTest );
const yProba = model . predictProba ( XTest );
// Compute metrics
const acc = accuracy ( yTest , yPred );
const prec = precision ( yTest , yPred , { average: 'weighted' });
const rec = recall ( yTest , yPred , { average: 'weighted' });
const f1 = f1Score ( yTest , yPred , { average: 'weighted' });
const auc = rocAucScore ( yTest , yProba . slice ([ null , 1 ]));
console . log ( '=== Model Evaluation ===' );
console . log ( `Accuracy: ${ ( acc * 100 ). toFixed ( 2 ) } %` );
console . log ( `Precision: ${ prec . toFixed ( 3 ) } ` );
console . log ( `Recall: ${ rec . toFixed ( 3 ) } ` );
console . log ( `F1-Score: ${ f1 . toFixed ( 3 ) } ` );
console . log ( `ROC-AUC: ${ auc . toFixed ( 3 ) } ` );
// Confusion matrix
const cm = confusionMatrix ( yTest , yPred );
console . log ( ' \n Confusion Matrix:' );
console . log ( cm );
// Detailed report
const report = classificationReport ( yTest , yPred );
console . log ( ' \n ' + report );
Metric Selection Guide
Imbalanced Classification
For imbalanced datasets, use:
Balanced Accuracy : Accounts for class imbalance
F1-Score : Harmonic mean of precision and recall
ROC-AUC : Threshold-independent metric
Precision-Recall AUC : Better for severe imbalance
Avoid accuracy alone as it can be misleading.
For multi-class classification:
Use average='macro' for equal class importance
Use average='weighted' to account for class imbalance
Use average='micro' for global performance
Choose metrics based on your needs:
MSE/RMSE : Penalizes large errors heavily
MAE : Equal weight to all errors
R² : Proportion of variance explained
MAPE : Percentage-based, scale-independent
Best Practices
Use multiple metrics to get a complete picture of model performance. No single metric tells the whole story.
For imbalanced datasets, focus on precision, recall, and F1-score rather than accuracy alone.
Visualize confusion matrices and ROC curves to understand where your model struggles.
Always evaluate on a held-out test set. Never use training data for evaluation.
Machine Learning Train models to evaluate
Preprocessing Cross-validation and splitting
Plotting Visualize evaluation results
Learn More
API Reference Complete API documentation
Tutorial Model evaluation guide