Tumor Mutational Burden (TMB) Analysis with PyMut-Bio¶
This notebook demonstrates how to use the calculate_tmb_analysis
method from PyMut-Bio to calculate Tumor Mutational Burden (TMB) and generate the corresponding analysis files.
What is TMB?¶
Tumor Mutational Burden (TMB) is a measure of the number of mutations present in a tumor, normalized by the size of the interrogated genome. It is an important biomarker in oncology.
Initial Setup¶
import os
from pyMut.input import read_maf
print('✅ Modules imported successfully')
✅ Modules imported successfully
Load Example Data¶
For this example, you will need a MAF file. You can use your own data or download example data from TCGA.
maf_path = "../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz" # Replace with the path to your MAF file
# Check if the file exists
if not os.path.exists(maf_path):
print(f"❌ File not found: {maf_path}")
print("📝 Please specify the correct path to your MAF file in the 'maf_path' variable")
print("💡 You can download example data from TCGA or use your own data")
else:
print(f'📂 Loading file: {maf_path}')
py_mutation = read_maf(maf_path, assembly="37")
print("✅ Data loaded successfully")
print(f"📊 Data shape: {py_mutation.data.shape}")
print(f"👥 Number of samples: {len(py_mutation.samples)}")
print(f"🧬 First 3 samples: {py_mutation.samples[:3]}")
2025-08-01 00:45:36,996 | INFO | pyMut.input | Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz 2025-08-01 00:45:36,998 | INFO | pyMut.input | Reading MAF with 'pyarrow' engine… 2025-08-01 00:45:37,007 | INFO | pyMut.input | Reading with 'pyarrow' completed. 2025-08-01 00:45:37,013 | INFO | pyMut.input | Detected 193 unique samples. 2025-08-01 00:45:37,098 | INFO | pyMut.input | Consolidating duplicate variants across samples... 2025-08-01 00:45:37,111 | INFO | pyMut.input | Consolidating variants using vectorized operations...
📂 Loading file: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz
2025-08-01 00:46:29,613 | INFO | pyMut.input | Variant consolidation completed in 52.51 seconds 2025-08-01 00:46:29,620 | INFO | pyMut.input | Consolidated 2207 rows into 2091 unique variants 2025-08-01 00:46:29,635 | INFO | pyMut.input | Saving to cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml.maf_8bfbda65c4b23428.parquet 2025-08-01 00:46:29,697 | INFO | pyMut.input | MAF processed successfully: 2091 rows, 216 columns in 52.70 seconds
✅ Data loaded successfully 📊 Data shape: (2091, 216) 👥 Number of samples: 193 🧬 First 3 samples: ['TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009']
Explore Variant Classification Columns¶
Before running the TMB analysis, let's see what variant classification columns are available:
# Search for variant classification columns
import re
pattern = re.compile(r'^(gencode_\d+_)?variant[_]?classification$', flags=re.IGNORECASE)
variant_cols = [col for col in py_mutation.data.columns if pattern.match(col)]
print("🔍 Variant classification columns found:")
if variant_cols:
for i, col in enumerate(variant_cols, 1):
print(f" {i}. {col}")
else:
print(" ❌ No variant classification columns found")
# Show some columns that contain 'variant' in the name
variant_like_cols = [col for col in py_mutation.data.columns if 'variant' in col.lower()]
print(f"\n🔍 Columns containing 'variant' ({len(variant_like_cols)}):")
for col in variant_like_cols[:5]: # Show only the first 5
print(f" • {col}")
🔍 Variant classification columns found: 1. Variant_Classification 🔍 Columns containing 'variant' (2): • Variant_Classification • Variant_Type
Run TMB Analysis¶
Now we will run the mutational burden analysis. The method will generate two files:
- TMB_analysis.tsv: Per-sample analysis with mutation counts and normalized TMB
- TMB_statistics.tsv: Global statistics (mean, median, quartiles, etc.)
# Create directory for results
output_dir = "results_tmb"
os.makedirs(output_dir, exist_ok=True)
print(f"📁 Output directory: {output_dir}")
📁 Output directory: results_tmb
# Run TMB analysis
print("🧬 Running mutational burden analysis...")
print("⏳ This may take a few moments...")
try:
# Run TMB analysis with standard configuration for WES
results = py_mutation.calculate_tmb_analysis(
genome_size_bp=60456963, # Standard size for WES
output_dir=output_dir,
save_files=True
)
print("✅ TMB analysis completed successfully!")
except Exception as e:
print(f"❌ Error during TMB analysis: {e}")
results = None
2025-08-01 00:46:29,809 | INFO | pyMut.analysis.mutation_burden | Auto-detected variant classification column: Variant_Classification
🧬 Running mutational burden analysis... ⏳ This may take a few moments...
2025-08-01 00:46:45,054 | INFO | pyMut.analysis.mutation_burden | TMB analysis saved to: results_tmb/TMB_analysis.tsv 2025-08-01 00:46:45,054 | INFO | pyMut.analysis.mutation_burden | TMB statistics saved to: results_tmb/TMB_statistics.tsv 2025-08-01 00:46:45,054 | INFO | pyMut.analysis.mutation_burden | Analyzed 193 samples with 2091 total mutations 2025-08-01 00:46:45,055 | INFO | pyMut.analysis.mutation_burden | TMB ANALYSIS SUMMARY 2025-08-01 00:46:45,055 | INFO | pyMut.analysis.mutation_burden | • Total samples analyzed: 193 2025-08-01 00:46:45,055 | INFO | pyMut.analysis.mutation_burden | • Average total mutations per sample: 11.4 2025-08-01 00:46:45,056 | INFO | pyMut.analysis.mutation_burden | • Average non-synonymous mutations per sample: 9.0 2025-08-01 00:46:45,057 | INFO | pyMut.analysis.mutation_burden | • Average normalized TMB (total): 0.189147 mutations/Mb 2025-08-01 00:46:45,057 | INFO | pyMut.analysis.mutation_burden | • Average normalized TMB (non-synonymous): 0.148438 mutations/Mb 2025-08-01 00:46:45,057 | INFO | pyMut.analysis.mutation_burden | • Sample with highest TMB: TCGA-AB-3009 2025-08-01 00:46:45,058 | INFO | pyMut.analysis.mutation_burden | - TMB value: 0.694709 mutations/Mb 2025-08-01 00:46:45,058 | INFO | pyMut.analysis.mutation_burden | • Sample with lowest TMB: TCGA-AB-2903 2025-08-01 00:46:45,059 | INFO | pyMut.analysis.mutation_burden | - TMB value: 0.016541 mutations/Mb
✅ TMB analysis completed successfully!
Explore Results¶
if results:
# Get the results DataFrames
analysis_df = results['analysis']
statistics_df = results['statistics']
print("📊 TMB ANALYSIS RESULTS")
print("=" * 50)
print(f"👥 Samples analyzed: {len(analysis_df)}")
print(f"📈 Metrics calculated: {len(statistics_df)}")
# Show the first rows of the per-sample analysis
print("\n🔍 First 5 samples from analysis:")
print("-" * 40)
display(analysis_df.head())
else:
print("❌ Could not obtain analysis results")
📊 TMB ANALYSIS RESULTS ================================================== 👥 Samples analyzed: 193 📈 Metrics calculated: 4 🔍 First 5 samples from analysis: ----------------------------------------
Sample | Total_Mutations | Non_Synonymous_Mutations | TMB_Total_Normalized | TMB_Non_Synonymous_Normalized | |
---|---|---|---|---|---|
0 | TCGA-AB-2988 | 15 | 13 | 0.248110 | 0.215029 |
1 | TCGA-AB-2869 | 12 | 8 | 0.198488 | 0.132326 |
2 | TCGA-AB-3009 | 42 | 34 | 0.694709 | 0.562384 |
3 | TCGA-AB-2830 | 17 | 13 | 0.281192 | 0.215029 |
4 | TCGA-AB-2887 | 15 | 12 | 0.248110 | 0.198488 |
Global Statistics¶
if results:
print("📈 TMB GLOBAL STATISTICS")
print("=" * 40)
display(statistics_df)
# Show some key statistics
print("\n🎯 KEY STATISTICS:")
print("-" * 30)
# Total normalized TMB
tmb_total_stats = statistics_df[statistics_df['Metric'] == 'TMB_Total_Normalized'].iloc[0]
print("🧬 Total Normalized TMB:")
print(f" • Mean: {tmb_total_stats['Mean']:.4f} mutations/Mb")
print(f" • Median: {tmb_total_stats['Median']:.4f} mutations/Mb")
print(f" • Range: {tmb_total_stats['Min']:.4f} - {tmb_total_stats['Max']:.4f} mutations/Mb")
# Non-synonymous normalized TMB
tmb_nonsyn_stats = statistics_df[statistics_df['Metric'] == 'TMB_Non_Synonymous_Normalized'].iloc[0]
print("\n🎯 Non-Synonymous Normalized TMB:")
print(f" • Mean: {tmb_nonsyn_stats['Mean']:.4f} mutations/Mb")
print(f" • Median: {tmb_nonsyn_stats['Median']:.4f} mutations/Mb")
print(f" • Range: {tmb_nonsyn_stats['Min']:.4f} - {tmb_nonsyn_stats['Max']:.4f} mutations/Mb")
📈 TMB GLOBAL STATISTICS ========================================
Metric | Count | Mean | Median | Min | Max | Q1 | Q3 | Std | |
---|---|---|---|---|---|---|---|---|---|
0 | Total_Mutations | 193 | 11.435233 | 11.000000 | 1.000000 | 42.000000 | 6.000000 | 15.000000 | 6.752870 |
1 | Non_Synonymous_Mutations | 193 | 8.974093 | 9.000000 | 0.000000 | 34.000000 | 5.000000 | 12.000000 | 5.452862 |
2 | TMB_Total_Normalized | 193 | 0.189147 | 0.181948 | 0.016541 | 0.694709 | 0.099244 | 0.248110 | 0.111697 |
3 | TMB_Non_Synonymous_Normalized | 193 | 0.148438 | 0.148866 | 0.000000 | 0.562384 | 0.082703 | 0.198488 | 0.090194 |
🎯 KEY STATISTICS: ------------------------------ 🧬 Total Normalized TMB: • Mean: 0.1891 mutations/Mb • Median: 0.1819 mutations/Mb • Range: 0.0165 - 0.6947 mutations/Mb 🎯 Non-Synonymous Normalized TMB: • Mean: 0.1484 mutations/Mb • Median: 0.1489 mutations/Mb • Range: 0.0000 - 0.5624 mutations/Mb
Important Notes¶
Genome size: The parameter
genome_size_bp=60456963
corresponds to the standard size for Whole Exome Sequencing (WES). Adjust this value according to your sequencing type:- WES: ~60 Mb
- WGS: ~3000 Mb
- Targeted panel: specific panel size
Output files: The TSV files are saved in the specified directory and contain:
TMB_analysis.tsv
: Detailed per-sample analysisTMB_statistics.tsv
: Summary statistics of the dataset
Interpretation: TMB values are expressed in mutations per megabase (mutations/Mb) and can be used as a biomarker for immunotherapy treatments.