Tumor Mutational Burden (TMB) Analysis with PyMut-Bio¶

This notebook demonstrates how to use the calculate_tmb_analysis method from PyMut-Bio to calculate Tumor Mutational Burden (TMB) and generate the corresponding analysis files.

What is TMB?¶

Tumor Mutational Burden (TMB) is a measure of the number of mutations present in a tumor, normalized by the size of the interrogated genome. It is an important biomarker in oncology.

Initial Setup¶

In [1]:

Copied!

import os
from pyMut.input import read_maf

print('✅ Modules imported successfully')
import os
from pyMut.input import read_maf

print('✅ Modules imported successfully')

✅ Modules imported successfully

Load Example Data¶

For this example, you will need a MAF file. You can use your own data or download example data from TCGA.

In [2]:

Copied!





maf_path = "../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz"  # Replace with the path to your MAF file

# Check if the file exists
if not os.path.exists(maf_path):
    print(f"❌ File not found: {maf_path}")
    print("📝 Please specify the correct path to your MAF file in the 'maf_path' variable")
    print("💡 You can download example data from TCGA or use your own data")
else:
    print(f'📂 Loading file: {maf_path}')
    py_mutation = read_maf(maf_path, assembly="37")
    
    print("✅ Data loaded successfully")
    print(f"📊 Data shape: {py_mutation.data.shape}")
    print(f"👥 Number of samples: {len(py_mutation.samples)}")
    print(f"🧬 First 3 samples: {py_mutation.samples[:3]}")
maf_path = "../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz"  # Replace with the path to your MAF file

# Check if the file exists
if not os.path.exists(maf_path):
    print(f"❌ File not found: {maf_path}")
    print("📝 Please specify the correct path to your MAF file in the 'maf_path' variable")
    print("💡 You can download example data from TCGA or use your own data")
else:
    print(f'📂 Loading file: {maf_path}')
    py_mutation = read_maf(maf_path, assembly="37")
    
    print("✅ Data loaded successfully")
    print(f"📊 Data shape: {py_mutation.data.shape}")
    print(f"👥 Number of samples: {len(py_mutation.samples)}")
    print(f"🧬 First 3 samples: {py_mutation.samples[:3]}")

2025-08-01 00:45:36,996 | INFO | pyMut.input | Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz
2025-08-01 00:45:36,998 | INFO | pyMut.input | Reading MAF with 'pyarrow' engine…
2025-08-01 00:45:37,007 | INFO | pyMut.input | Reading with 'pyarrow' completed.
2025-08-01 00:45:37,013 | INFO | pyMut.input | Detected 193 unique samples.
2025-08-01 00:45:37,098 | INFO | pyMut.input | Consolidating duplicate variants across samples...
2025-08-01 00:45:37,111 | INFO | pyMut.input | Consolidating variants using vectorized operations...

📂 Loading file: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz

2025-08-01 00:46:29,613 | INFO | pyMut.input | Variant consolidation completed in 52.51 seconds
2025-08-01 00:46:29,620 | INFO | pyMut.input | Consolidated 2207 rows into 2091 unique variants
2025-08-01 00:46:29,635 | INFO | pyMut.input | Saving to cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml.maf_8bfbda65c4b23428.parquet
2025-08-01 00:46:29,697 | INFO | pyMut.input | MAF processed successfully: 2091 rows, 216 columns in 52.70 seconds

✅ Data loaded successfully
📊 Data shape: (2091, 216)
👥 Number of samples: 193
🧬 First 3 samples: ['TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009']

Explore Variant Classification Columns¶

Before running the TMB analysis, let's see what variant classification columns are available:

In [3]:

Copied!





# Search for variant classification columns
import re

pattern = re.compile(r'^(gencode_\d+_)?variant[_]?classification$', flags=re.IGNORECASE)
variant_cols = [col for col in py_mutation.data.columns if pattern.match(col)]

print("🔍 Variant classification columns found:")
if variant_cols:
    for i, col in enumerate(variant_cols, 1):
        print(f"  {i}. {col}")
else:
    print("  ❌ No variant classification columns found")

# Show some columns that contain 'variant' in the name
variant_like_cols = [col for col in py_mutation.data.columns if 'variant' in col.lower()]
print(f"\n🔍 Columns containing 'variant' ({len(variant_like_cols)}):")
for col in variant_like_cols[:5]:  # Show only the first 5
    print(f"  • {col}")
# Search for variant classification columns
import re

pattern = re.compile(r'^(gencode_\d+_)?variant[_]?classification$', flags=re.IGNORECASE)
variant_cols = [col for col in py_mutation.data.columns if pattern.match(col)]

print("🔍 Variant classification columns found:")
if variant_cols:
    for i, col in enumerate(variant_cols, 1):
        print(f"  {i}. {col}")
else:
    print("  ❌ No variant classification columns found")

# Show some columns that contain 'variant' in the name
variant_like_cols = [col for col in py_mutation.data.columns if 'variant' in col.lower()]
print(f"\n🔍 Columns containing 'variant' ({len(variant_like_cols)}):")
for col in variant_like_cols[:5]:  # Show only the first 5
    print(f"  • {col}")

🔍 Variant classification columns found:
  1. Variant_Classification

🔍 Columns containing 'variant' (2):
  • Variant_Classification
  • Variant_Type

Run TMB Analysis¶

Now we will run the mutational burden analysis. The method will generate two files:

TMB_analysis.tsv: Per-sample analysis with mutation counts and normalized TMB
TMB_statistics.tsv: Global statistics (mean, median, quartiles, etc.)

In [4]:

Copied!

# Create directory for results
output_dir = "results_tmb"
os.makedirs(output_dir, exist_ok=True)

print(f"📁 Output directory: {output_dir}")
# Create directory for results
output_dir = "results_tmb"
os.makedirs(output_dir, exist_ok=True)

print(f"📁 Output directory: {output_dir}")

📁 Output directory: results_tmb

In [5]:

Copied!





# Run TMB analysis
print("🧬 Running mutational burden analysis...")
print("⏳ This may take a few moments...")

try:
    # Run TMB analysis with standard configuration for WES
    results = py_mutation.calculate_tmb_analysis(
        genome_size_bp=60456963,  # Standard size for WES
        output_dir=output_dir,
        save_files=True
    )
    
    print("✅ TMB analysis completed successfully!")
    
except Exception as e:
    print(f"❌ Error during TMB analysis: {e}")
    results = None
# Run TMB analysis
print("🧬 Running mutational burden analysis...")
print("⏳ This may take a few moments...")

try:
    # Run TMB analysis with standard configuration for WES
    results = py_mutation.calculate_tmb_analysis(
        genome_size_bp=60456963,  # Standard size for WES
        output_dir=output_dir,
        save_files=True
    )
    
    print("✅ TMB analysis completed successfully!")
    
except Exception as e:
    print(f"❌ Error during TMB analysis: {e}")
    results = None

2025-08-01 00:46:29,809 | INFO | pyMut.analysis.mutation_burden | Auto-detected variant classification column: Variant_Classification

🧬 Running mutational burden analysis...
⏳ This may take a few moments...

2025-08-01 00:46:45,054 | INFO | pyMut.analysis.mutation_burden | TMB analysis saved to: results_tmb/TMB_analysis.tsv
2025-08-01 00:46:45,054 | INFO | pyMut.analysis.mutation_burden | TMB statistics saved to: results_tmb/TMB_statistics.tsv
2025-08-01 00:46:45,054 | INFO | pyMut.analysis.mutation_burden | Analyzed 193 samples with 2091 total mutations
2025-08-01 00:46:45,055 | INFO | pyMut.analysis.mutation_burden | TMB ANALYSIS SUMMARY
2025-08-01 00:46:45,055 | INFO | pyMut.analysis.mutation_burden | • Total samples analyzed: 193
2025-08-01 00:46:45,055 | INFO | pyMut.analysis.mutation_burden | • Average total mutations per sample: 11.4
2025-08-01 00:46:45,056 | INFO | pyMut.analysis.mutation_burden | • Average non-synonymous mutations per sample: 9.0
2025-08-01 00:46:45,057 | INFO | pyMut.analysis.mutation_burden | • Average normalized TMB (total): 0.189147 mutations/Mb
2025-08-01 00:46:45,057 | INFO | pyMut.analysis.mutation_burden | • Average normalized TMB (non-synonymous): 0.148438 mutations/Mb
2025-08-01 00:46:45,057 | INFO | pyMut.analysis.mutation_burden | • Sample with highest TMB: TCGA-AB-3009
2025-08-01 00:46:45,058 | INFO | pyMut.analysis.mutation_burden |   - TMB value: 0.694709 mutations/Mb
2025-08-01 00:46:45,058 | INFO | pyMut.analysis.mutation_burden | • Sample with lowest TMB: TCGA-AB-2903
2025-08-01 00:46:45,059 | INFO | pyMut.analysis.mutation_burden |   - TMB value: 0.016541 mutations/Mb

✅ TMB analysis completed successfully!

Explore Results¶

In [6]:

Copied!





if results:
    # Get the results DataFrames
    analysis_df = results['analysis']
    statistics_df = results['statistics']
    
    print("📊 TMB ANALYSIS RESULTS")
    print("=" * 50)
    print(f"👥 Samples analyzed: {len(analysis_df)}")
    print(f"📈 Metrics calculated: {len(statistics_df)}")
    
    # Show the first rows of the per-sample analysis
    print("\n🔍 First 5 samples from analysis:")
    print("-" * 40)
    display(analysis_df.head())
    
else:
    print("❌ Could not obtain analysis results")
if results:
    # Get the results DataFrames
    analysis_df = results['analysis']
    statistics_df = results['statistics']
    
    print("📊 TMB ANALYSIS RESULTS")
    print("=" * 50)
    print(f"👥 Samples analyzed: {len(analysis_df)}")
    print(f"📈 Metrics calculated: {len(statistics_df)}")
    
    # Show the first rows of the per-sample analysis
    print("\n🔍 First 5 samples from analysis:")
    print("-" * 40)
    display(analysis_df.head())
    
else:
    print("❌ Could not obtain analysis results")

📊 TMB ANALYSIS RESULTS
==================================================
👥 Samples analyzed: 193
📈 Metrics calculated: 4

🔍 First 5 samples from analysis:
----------------------------------------

	Sample	Total_Mutations	Non_Synonymous_Mutations	TMB_Total_Normalized	TMB_Non_Synonymous_Normalized
0	TCGA-AB-2988	15	13	0.248110	0.215029
1	TCGA-AB-2869	12	8	0.198488	0.132326
2	TCGA-AB-3009	42	34	0.694709	0.562384
3	TCGA-AB-2830	17	13	0.281192	0.215029
4	TCGA-AB-2887	15	12	0.248110	0.198488

Global Statistics¶

In [7]:

Copied!





if results:
    print("📈 TMB GLOBAL STATISTICS")
    print("=" * 40)
    display(statistics_df)
    
    # Show some key statistics
    print("\n🎯 KEY STATISTICS:")
    print("-" * 30)
    
    # Total normalized TMB
    tmb_total_stats = statistics_df[statistics_df['Metric'] == 'TMB_Total_Normalized'].iloc[0]
    print("🧬 Total Normalized TMB:")
    print(f"   • Mean: {tmb_total_stats['Mean']:.4f} mutations/Mb")
    print(f"   • Median: {tmb_total_stats['Median']:.4f} mutations/Mb")
    print(f"   • Range: {tmb_total_stats['Min']:.4f} - {tmb_total_stats['Max']:.4f} mutations/Mb")
    
    # Non-synonymous normalized TMB
    tmb_nonsyn_stats = statistics_df[statistics_df['Metric'] == 'TMB_Non_Synonymous_Normalized'].iloc[0]
    print("\n🎯 Non-Synonymous Normalized TMB:")
    print(f"   • Mean: {tmb_nonsyn_stats['Mean']:.4f} mutations/Mb")
    print(f"   • Median: {tmb_nonsyn_stats['Median']:.4f} mutations/Mb")
    print(f"   • Range: {tmb_nonsyn_stats['Min']:.4f} - {tmb_nonsyn_stats['Max']:.4f} mutations/Mb")
if results:
    print("📈 TMB GLOBAL STATISTICS")
    print("=" * 40)
    display(statistics_df)
    
    # Show some key statistics
    print("\n🎯 KEY STATISTICS:")
    print("-" * 30)
    
    # Total normalized TMB
    tmb_total_stats = statistics_df[statistics_df['Metric'] == 'TMB_Total_Normalized'].iloc[0]
    print("🧬 Total Normalized TMB:")
    print(f"   • Mean: {tmb_total_stats['Mean']:.4f} mutations/Mb")
    print(f"   • Median: {tmb_total_stats['Median']:.4f} mutations/Mb")
    print(f"   • Range: {tmb_total_stats['Min']:.4f} - {tmb_total_stats['Max']:.4f} mutations/Mb")
    
    # Non-synonymous normalized TMB
    tmb_nonsyn_stats = statistics_df[statistics_df['Metric'] == 'TMB_Non_Synonymous_Normalized'].iloc[0]
    print("\n🎯 Non-Synonymous Normalized TMB:")
    print(f"   • Mean: {tmb_nonsyn_stats['Mean']:.4f} mutations/Mb")
    print(f"   • Median: {tmb_nonsyn_stats['Median']:.4f} mutations/Mb")
    print(f"   • Range: {tmb_nonsyn_stats['Min']:.4f} - {tmb_nonsyn_stats['Max']:.4f} mutations/Mb")

📈 TMB GLOBAL STATISTICS
========================================

	Metric	Count	Mean	Median	Min	Max	Q1	Q3	Std
0	Total_Mutations	193	11.435233	11.000000	1.000000	42.000000	6.000000	15.000000	6.752870
1	Non_Synonymous_Mutations	193	8.974093	9.000000	0.000000	34.000000	5.000000	12.000000	5.452862
2	TMB_Total_Normalized	193	0.189147	0.181948	0.016541	0.694709	0.099244	0.248110	0.111697
3	TMB_Non_Synonymous_Normalized	193	0.148438	0.148866	0.000000	0.562384	0.082703	0.198488	0.090194

🎯 KEY STATISTICS:
------------------------------
🧬 Total Normalized TMB:
   • Mean: 0.1891 mutations/Mb
   • Median: 0.1819 mutations/Mb
   • Range: 0.0165 - 0.6947 mutations/Mb

🎯 Non-Synonymous Normalized TMB:
   • Mean: 0.1484 mutations/Mb
   • Median: 0.1489 mutations/Mb
   • Range: 0.0000 - 0.5624 mutations/Mb

Important Notes¶

Genome size: The parameter genome_size_bp=60456963 corresponds to the standard size for Whole Exome Sequencing (WES). Adjust this value according to your sequencing type:
- WES: ~60 Mb
- WGS: ~3000 Mb
- Targeted panel: specific panel size
Output files: The TSV files are saved in the specified directory and contain:
- TMB_analysis.tsv: Detailed per-sample analysis
- TMB_statistics.tsv: Summary statistics of the dataset
Interpretation: TMB values are expressed in mutations per megabase (mutations/Mb) and can be used as a biomarker for immunotherapy treatments.