While TRIFID achieves high prediction accuracy, understanding why the model makes specific predictions is crucial for:
Biological insight: Identifying which molecular features drive isoform functionality
Model validation: Ensuring predictions align with biological expectations
Hypothesis generation: Discovering unexpected patterns in isoform regulation
Trust and adoption: Building confidence in ML predictions for experimental follow-up
TRIFID uses SHAP (SHapley Additive exPlanations) to provide transparent, interpretable explanations for every prediction.
SHAP is a game-theoretic approach to explain ML model outputs. It assigns each feature an importance value for a particular prediction, showing how much each feature contributed to pushing the prediction above or below the baseline.
TRIFID uses SHAP’s TreeExplainer optimized for tree-based models:
# From trifid/models/interpret.py:191-206@propertydef shap(self): """Calculate SHAP values for feature importance""" explainer = shap.TreeExplainer(self.model) shap_values = explainer.shap_values(self.train_features) # Mean absolute SHAP value per feature vals = np.abs(shap_values).mean(0) std_vals = np.abs(shap_values).std(0) df = pd.DataFrame( list(zip(self.train_features.columns, vals, std_vals)), columns=['feature', 'values_mean', 'values_std'] )
What are SHAP values?
SHAP values decompose a prediction into contributions from each feature:Base value + Feature 1 contribution + Feature 2 contribution + … = Final predictionProperties:
Additive: Sum of SHAP values + base value = model output
Consistency: Higher feature value → Higher SHAP value (if feature is positively correlated)
Local accuracy: Explains individual predictions, not just global patterns
Missingness: Features with missing values automatically get 0 SHAP value
Example: For a transcript with trifid_score = 0.82:
Top 10 Most Important Features (typical TRIFID model):
norm_corsair (0.082) - Cross-species conservation
length_delta_score (0.071) - Length similarity to principal
norm_ScorePerCodon (0.065) - PhyloCSF coding signal
norm_spade (0.058) - Pfam domain integrity
CCDS (0.052) - Consensus coding sequence
tsl_1 (0.048) - Highest transcript support
norm_firestar (0.043) - Functional residues
pfam_score (0.039) - Domain coverage
norm_RNA2sj_cds (0.035) - Junction support (human)
perc_Lost_State (0.031) - Domain loss percentage
Evolutionary conservation (CORSAIR, PhyloCSF) consistently ranks as the top predictor, reflecting the principle that functional isoforms are preferentially conserved across species.
Use local_explanation() to understand why TRIFID predicted a specific isoform as functional or non-functional. This is invaluable for generating testable hypotheses.
# From trifid/models/interpret.py:307-319elif idx == 'gene_name': # Explain all isoforms of a gene explain_gene = {} for i in range(0, len(df_sample)): explain_gene[df_sample.index[i][1]] = np.abs( explainer.shap_values(df_sample.iloc[i]) ).mean(0) df = pd.DataFrame(explain_gene).T df['sum'] = df.T.sum() # Total SHAP magnitude per isoform df = df.sort_values(by='sum', ascending=False)
Use case: Compare SHAP patterns across all isoforms of a gene to identify which features differ between functional and non-functional variants.
|SHAP| > 0.10: Dominant feature, major contributor
|SHAP| 0.05-0.10: Important feature, moderate effect
|SHAP| 0.01-0.05: Minor feature, small effect
|SHAP| < 0.01: Negligible feature, minimal impact
SHAP values are contributions, not feature values. A feature can have a high value but low SHAP (if it’s similar across isoforms) or low value but high SHAP (if it’s discriminative).
Finding: Evolutionary features (CORSAIR, PhyloCSF) have highest SHAP values.Interpretation: Functional isoforms are under purifying selection and conserved across species. This validates the biological principle that function implies constraint.
Finding: length_delta_score ranks 2nd in importance.Interpretation: Truncated isoforms lacking large portions of the principal isoform are likely non-functional. However, small length differences may be functionally neutral.
Finding: SPADE, pfam_score, and domain loss features are highly important.Interpretation: Alternative splicing that damages or removes functional domains strongly indicates non-functionality.
Finding: CCDS, TSL, and basic tag contribute significantly.Interpretation: Well-annotated, high-confidence transcripts are more likely functional, reflecting curation bias toward functionally important isoforms.
Finding: RNA2sj_cds has moderate importance for human.Interpretation: Splice junctions with strong RNA-seq support are more likely real, but low support doesn’t necessarily mean non-functional (could be tissue-specific or rare).
# Explain all isoforms of BRCA1gene_explanation = interpreter.local_explanation( df_features=full_database, sample='BRCA1')# Shows which features differ between isoformsprint(gene_explanation.T.sort_values(by='sum', ascending=False))
Use case: Identify which alternative isoform is most likely functional and why it differs from others.