PyMutation Filtering Methods Example¶
This notebook demonstrates the various filtering methods available in PyMutation:
filter_by_chrom_sample
: Filter by chromosome and/or sampleregion
: Filter by genomic coordinatesgen_region
: Filter by gene namepass_filter
: Check if specific records have FILTER == "PASS"
import os
from IPython.display import display
from pyMut.input import read_maf
Load TCGA LAML Dataset¶
# Load real TCGA LAML data
maf_path = os.path.join('..', '..', '..', 'src', 'pyMut', 'data', 'examples', 'MAF','tcga_laml.maf.gz')
# TCGA data is typically based on GRCh37 assembly
py_mut = read_maf(maf_path, assembly="37")
print(f"Loaded TCGA LAML data: {len(py_mut.data)} variants")
print(f"Unique genes: {py_mut.data['Hugo_Symbol'].nunique()}")
print(f"Unique samples: {py_mut.data['Tumor_Sample_Barcode'].nunique()}")
print(f"Chromosomes present: {sorted(py_mut.data['CHROM'].unique())}")
# Display first few rows
print("\nFirst 5 rows of the dataset:")
display(py_mut.data.head())
2025-08-01 02:02:53,078 | INFO | pyMut.input | Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz 2025-08-01 02:02:53,079 | INFO | pyMut.input | Loading from cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml.maf_8bfbda65c4b23428.parquet 2025-08-01 02:02:53,105 | INFO | pyMut.input | Cache loaded successfully in 0.03 seconds
Loaded TCGA LAML data: 2091 variants Unique genes: 1611 Unique samples: 190 Chromosomes present: ['X', 'chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9'] First 5 rows of the dataset:
CHROM | POS | ID | REF | ALT | QUAL | FILTER | TCGA-AB-2988 | TCGA-AB-2869 | TCGA-AB-3009 | ... | Strand | Variant_Classification | Variant_Type | Reference_Allele | Tumor_Seq_Allele1 | Tumor_Seq_Allele2 | Tumor_Sample_Barcode | Protein_Change | i_TumorVAF_WU | i_transcript_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr9 | 100077177 | . | T | C | . | . | T|T | T|T | T|T | ... | + | SILENT | SNP | T | T | C | TCGA-AB-2886 | p.T431T | 9.76 | NM_020893.1 |
1 | chr9 | 100085148 | . | G | A | . | . | G|G | G|G | G|G | ... | + | MISSENSE_MUTATION | SNP | G | G | A | TCGA-AB-2917 | p.R581H | 18.4 | NM_020893.1 |
2 | chr9 | 100971322 | . | A | C | . | . | A|A | A|A | A|A | ... | + | MISSENSE_MUTATION | SNP | A | A | C | TCGA-AB-2841 | p.L593R | 45.83 | NM_018421.3 |
3 | chr9 | 104086335 | . | C | T | . | . | C|C | C|C | C|C | ... | + | MISSENSE_MUTATION | SNP | C | C | T | TCGA-AB-2877 | p.T325I | 37.12 | NM_017753.2 |
4 | chr9 | 104124840 | . | G | A | . | . | G|A | G|G | G|G | ... | + | MISSENSE_MUTATION | SNP | G | G | A | TCGA-AB-2988 | p.T376M | 48.35 | NM_001701.1 |
5 rows × 216 columns
1. Chromosome and Sample Filtering (filter_by_chrom_sample)¶
This method allows filtering by chromosome and/or sample. It comes from chrom_sample_filter.py
.
print("=== Chromosome and Sample Filtering Examples ===")
# Example 1: Filter by chromosome only
print("\n1. Filter by chromosome 17:")
filtered_chr17 = py_mut.filter_by_chrom_sample(chrom='17')
print(f"Original variants: {len(py_mut.data)}")
print(f"Chromosome 17 variants: {len(filtered_chr17.data)}")
# Example 2: Filter by multiple chromosomes
print("\n2. Filter by chromosomes 17 and X:")
filtered_multi_chr = py_mut.filter_by_chrom_sample(chrom=['17', 'X'])
print(f"Chromosomes 17 and X variants: {len(filtered_multi_chr.data)}")
# Example 3: Filter by sample (get first few samples)
sample_list = py_mut.data['Tumor_Sample_Barcode'].unique()[:3].tolist()
print(f"\n3. Filter by first 3 samples: {sample_list}")
filtered_samples = py_mut.filter_by_chrom_sample(sample=sample_list)
print(f"Filtered by samples variants: {len(filtered_samples.data)}")
print(f"Unique samples in filtered data: {filtered_samples.data['Tumor_Sample_Barcode'].nunique()}")
# Example 4: Combined filtering (chromosome + sample)
print(f"\n4. Combined filter - chromosome 17 + first sample:")
filtered_combined = py_mut.filter_by_chrom_sample(chrom='17', sample=sample_list[0])
print(f"Combined filter variants: {len(filtered_combined.data)}")
2025-08-01 02:02:53,152 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17'] 2025-08-01 02:02:53,156 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17 2025-08-01 02:02:53,156 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17 2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091 2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 99 2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 1992 2025-08-01 02:02:53,157 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17 2025-08-01 02:02:53,160 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17', 'chrX'] 2025-08-01 02:02:53,165 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17,chrX 2025-08-01 02:02:53,166 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17,chrX 2025-08-01 02:02:53,166 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091 2025-08-01 02:02:53,167 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 205 2025-08-01 02:02:53,167 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 1886 2025-08-01 02:02:53,167 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17,chrX 2025-08-01 02:02:53,169 | INFO | pyMut.filters.chrom_sample_filter | Samples to filter: ['TCGA-AB-2886', 'TCGA-AB-2917', 'TCGA-AB-2841'] 2025-08-01 02:02:53,170 | INFO | pyMut.filters.chrom_sample_filter | Using MAF-style filtering with column 'Tumor_Sample_Barcode' 2025-08-01 02:02:53,173 | INFO | pyMut.filters.chrom_sample_filter | Sample columns kept: ['TCGA-AB-2886', 'TCGA-AB-2917', 'TCGA-AB-2841'] 2025-08-01 02:02:53,173 | INFO | pyMut.filters.chrom_sample_filter | Sample columns removed: ['TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009', 'TCGA-AB-2830', 'TCGA-AB-2887', 'TCGA-AB-2920', 'TCGA-AB-2934', 'TCGA-AB-2905', 'TCGA-AB-2999', 'TCGA-AB-2898', 'TCGA-AB-2950', 'TCGA-AB-2923', 'TCGA-AB-2847', 'TCGA-AB-2973', 'TCGA-AB-2931', 'TCGA-AB-2936', 'TCGA-AB-2854', 'TCGA-AB-2906', 'TCGA-AB-2819', 'TCGA-AB-2894', 'TCGA-AB-2945', 'TCGA-AB-2913', 'TCGA-AB-2996', 'TCGA-AB-2952', 'TCGA-AB-2805', 'TCGA-AB-2833', 'TCGA-AB-2862', 'TCGA-AB-2890', 'TCGA-AB-2832', 'TCGA-AB-2911', 'TCGA-AB-2912', 'TCGA-AB-3001', 'TCGA-AB-2910', 'TCGA-AB-2992', 'TCGA-AB-2901', 'TCGA-AB-2822', 'TCGA-AB-2964', 'TCGA-AB-2915', 'TCGA-AB-2807', 'TCGA-AB-2997', 'TCGA-AB-2926', 'TCGA-AB-2897', 'TCGA-AB-2927', 'TCGA-AB-2895', 'TCGA-AB-2929', 'TCGA-AB-2899', 'TCGA-AB-2882', 'TCGA-AB-2935', 'TCGA-AB-2907', 'TCGA-AB-2853', 'TCGA-AB-2889', 'TCGA-AB-2900', 'TCGA-AB-2976', 'TCGA-AB-2990', 'TCGA-AB-2984', 'TCGA-AB-2802', 'TCGA-AB-2864', 'TCGA-AB-2838', 'TCGA-AB-2966', 'TCGA-AB-2903', 'TCGA-AB-2817', 'TCGA-AB-2921', 'TCGA-AB-2959', 'TCGA-AB-2846', 'TCGA-AB-2930', 'TCGA-AB-2888', 'TCGA-AB-2994', 'TCGA-AB-2828', 'TCGA-AB-3002', 'TCGA-AB-2861', 'TCGA-AB-2908', 'TCGA-AB-2991', 'TCGA-AB-2922', 'TCGA-AB-2916', 'TCGA-AB-2874', 'TCGA-AB-2980', 'TCGA-AB-2972', 'TCGA-AB-2821', 'TCGA-AB-2885', 'TCGA-AB-2870', 'TCGA-AB-2814', 'TCGA-AB-2804', 'TCGA-AB-2865', 'TCGA-AB-2978', 'TCGA-AB-2943', 'TCGA-AB-3005', 'TCGA-AB-3006', 'TCGA-AB-2827', 'TCGA-AB-2956', 'TCGA-AB-2857', 'TCGA-AB-2813', 'TCGA-AB-2924', 'TCGA-AB-2968', 'TCGA-AB-2970', 'TCGA-AB-2963', 'TCGA-AB-2925', 'TCGA-AB-2971', 'TCGA-AB-2904', 'TCGA-AB-2875', 'TCGA-AB-2985', 'TCGA-AB-2876', 'TCGA-AB-2891', 'TCGA-AB-2851', 'TCGA-AB-2831', 'TCGA-AB-2939', 'TCGA-AB-2858', 'TCGA-AB-2977', 'TCGA-AB-2839', 'TCGA-AB-2868', 'TCGA-AB-2820', 'TCGA-AB-2859', 'TCGA-AB-2937', 'TCGA-AB-2949', 'TCGA-AB-2818', 'TCGA-AB-2881', 'TCGA-AB-2808', 'TCGA-AB-2938', 'TCGA-AB-2803', 'TCGA-AB-3008', 'TCGA-AB-2816', 'TCGA-AB-2806', 'TCGA-AB-2878', 'TCGA-AB-3012', 'TCGA-AB-2810', 'TCGA-AB-2872', 'TCGA-AB-2845', 'TCGA-AB-2849', 'TCGA-AB-2914', 'TCGA-AB-2989', 'TCGA-AB-2928', 'TCGA-AB-2863', 'TCGA-AB-2955', 'TCGA-AB-2843', 'TCGA-AB-2993', 'TCGA-AB-2880', 'TCGA-AB-2940', 'TCGA-AB-2979', 'TCGA-AB-3000', 'TCGA-AB-3007', 'TCGA-AB-2867', 'TCGA-AB-2812', 'TCGA-AB-2983', 'TCGA-AB-2860', 'TCGA-AB-2829', 'TCGA-AB-2871', 'TCGA-AB-2986', 'TCGA-AB-2946', 'TCGA-AB-2995', 'TCGA-AB-2918', 'TCGA-AB-2809', 'TCGA-AB-2824', 'TCGA-AB-2825', 'TCGA-AB-2873', 'TCGA-AB-2884', 'TCGA-AB-2896', 'TCGA-AB-2919', 'TCGA-AB-2947', 'TCGA-AB-2965', 'TCGA-AB-2967', 'TCGA-AB-2974', 'TCGA-AB-2975', 'TCGA-AB-2981', 'TCGA-AB-2987', 'TCGA-AB-2932', 'TCGA-AB-2877', 'TCGA-AB-2892', 'TCGA-AB-2957', 'TCGA-AB-2998', 'TCGA-AB-2823', 'TCGA-AB-2844', 'TCGA-AB-2982', 'TCGA-AB-2834', 'TCGA-AB-2836', 'TCGA-AB-2840', 'TCGA-AB-2879', 'TCGA-AB-2909', 'TCGA-AB-2942', 'TCGA-AB-3011', 'TCGA-AB-2835', 'TCGA-AB-2826', 'TCGA-AB-2850', 'TCGA-AB-2866', 'TCGA-AB-2948', 'TCGA-AB-2842', 'TCGA-AB-2883', 'TCGA-AB-2941', 'TCGA-AB-2954', 'TCGA-AB-2848', 'TCGA-AB-2855', 'TCGA-AB-2933'] 2025-08-01 02:02:53,174 | INFO | pyMut.filters.chrom_sample_filter | Applied sample filter: TCGA-AB-2886,TCGA-AB-2917,TCGA-AB-2841 2025-08-01 02:02:53,174 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: sample:TCGA-AB-2886,TCGA-AB-2917,TCGA-AB-2841 2025-08-01 02:02:53,175 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091 2025-08-01 02:02:53,175 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 34 2025-08-01 02:02:53,175 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 2057 2025-08-01 02:02:53,176 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: sample:TCGA-AB-2886,TCGA-AB-2917,TCGA-AB-2841 2025-08-01 02:02:53,178 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17'] 2025-08-01 02:02:53,181 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17 2025-08-01 02:02:53,182 | INFO | pyMut.filters.chrom_sample_filter | Samples to filter: ['TCGA-AB-2886'] 2025-08-01 02:02:53,182 | INFO | pyMut.filters.chrom_sample_filter | Using MAF-style filtering with column 'Tumor_Sample_Barcode' 2025-08-01 02:02:53,185 | INFO | pyMut.filters.chrom_sample_filter | Sample columns kept: ['TCGA-AB-2886'] 2025-08-01 02:02:53,185 | INFO | pyMut.filters.chrom_sample_filter | Sample columns removed: ['TCGA-AB-2988', 'TCGA-AB-2869', 'TCGA-AB-3009', 'TCGA-AB-2830', 'TCGA-AB-2887', 'TCGA-AB-2920', 'TCGA-AB-2934', 'TCGA-AB-2905', 'TCGA-AB-2999', 'TCGA-AB-2898', 'TCGA-AB-2950', 'TCGA-AB-2923', 'TCGA-AB-2847', 'TCGA-AB-2973', 'TCGA-AB-2931', 'TCGA-AB-2936', 'TCGA-AB-2854', 'TCGA-AB-2906', 'TCGA-AB-2819', 'TCGA-AB-2894', 'TCGA-AB-2945', 'TCGA-AB-2913', 'TCGA-AB-2996', 'TCGA-AB-2952', 'TCGA-AB-2805', 'TCGA-AB-2833', 'TCGA-AB-2862', 'TCGA-AB-2890', 'TCGA-AB-2832', 'TCGA-AB-2911', 'TCGA-AB-2912', 'TCGA-AB-3001', 'TCGA-AB-2910', 'TCGA-AB-2992', 'TCGA-AB-2901', 'TCGA-AB-2822', 'TCGA-AB-2964', 'TCGA-AB-2915', 'TCGA-AB-2807', 'TCGA-AB-2997', 'TCGA-AB-2926', 'TCGA-AB-2897', 'TCGA-AB-2927', 'TCGA-AB-2895', 'TCGA-AB-2929', 'TCGA-AB-2899', 'TCGA-AB-2882', 'TCGA-AB-2935', 'TCGA-AB-2907', 'TCGA-AB-2853', 'TCGA-AB-2889', 'TCGA-AB-2900', 'TCGA-AB-2976', 'TCGA-AB-2990', 'TCGA-AB-2984', 'TCGA-AB-2802', 'TCGA-AB-2864', 'TCGA-AB-2838', 'TCGA-AB-2966', 'TCGA-AB-2903', 'TCGA-AB-2817', 'TCGA-AB-2921', 'TCGA-AB-2959', 'TCGA-AB-2846', 'TCGA-AB-2930', 'TCGA-AB-2888', 'TCGA-AB-2994', 'TCGA-AB-2828', 'TCGA-AB-3002', 'TCGA-AB-2861', 'TCGA-AB-2908', 'TCGA-AB-2991', 'TCGA-AB-2922', 'TCGA-AB-2916', 'TCGA-AB-2874', 'TCGA-AB-2980', 'TCGA-AB-2972', 'TCGA-AB-2821', 'TCGA-AB-2885', 'TCGA-AB-2870', 'TCGA-AB-2814', 'TCGA-AB-2804', 'TCGA-AB-2865', 'TCGA-AB-2978', 'TCGA-AB-2943', 'TCGA-AB-3005', 'TCGA-AB-3006', 'TCGA-AB-2827', 'TCGA-AB-2956', 'TCGA-AB-2857', 'TCGA-AB-2841', 'TCGA-AB-2917', 'TCGA-AB-2813', 'TCGA-AB-2924', 'TCGA-AB-2968', 'TCGA-AB-2970', 'TCGA-AB-2963', 'TCGA-AB-2925', 'TCGA-AB-2971', 'TCGA-AB-2904', 'TCGA-AB-2875', 'TCGA-AB-2985', 'TCGA-AB-2876', 'TCGA-AB-2891', 'TCGA-AB-2851', 'TCGA-AB-2831', 'TCGA-AB-2939', 'TCGA-AB-2858', 'TCGA-AB-2977', 'TCGA-AB-2839', 'TCGA-AB-2868', 'TCGA-AB-2820', 'TCGA-AB-2859', 'TCGA-AB-2937', 'TCGA-AB-2949', 'TCGA-AB-2818', 'TCGA-AB-2881', 'TCGA-AB-2808', 'TCGA-AB-2938', 'TCGA-AB-2803', 'TCGA-AB-3008', 'TCGA-AB-2816', 'TCGA-AB-2806', 'TCGA-AB-2878', 'TCGA-AB-3012', 'TCGA-AB-2810', 'TCGA-AB-2872', 'TCGA-AB-2845', 'TCGA-AB-2849', 'TCGA-AB-2914', 'TCGA-AB-2989', 'TCGA-AB-2928', 'TCGA-AB-2863', 'TCGA-AB-2955', 'TCGA-AB-2843', 'TCGA-AB-2993', 'TCGA-AB-2880', 'TCGA-AB-2940', 'TCGA-AB-2979', 'TCGA-AB-3000', 'TCGA-AB-3007', 'TCGA-AB-2867', 'TCGA-AB-2812', 'TCGA-AB-2983', 'TCGA-AB-2860', 'TCGA-AB-2829', 'TCGA-AB-2871', 'TCGA-AB-2986', 'TCGA-AB-2946', 'TCGA-AB-2995', 'TCGA-AB-2918', 'TCGA-AB-2809', 'TCGA-AB-2824', 'TCGA-AB-2825', 'TCGA-AB-2873', 'TCGA-AB-2884', 'TCGA-AB-2896', 'TCGA-AB-2919', 'TCGA-AB-2947', 'TCGA-AB-2965', 'TCGA-AB-2967', 'TCGA-AB-2974', 'TCGA-AB-2975', 'TCGA-AB-2981', 'TCGA-AB-2987', 'TCGA-AB-2932', 'TCGA-AB-2877', 'TCGA-AB-2892', 'TCGA-AB-2957', 'TCGA-AB-2998', 'TCGA-AB-2823', 'TCGA-AB-2844', 'TCGA-AB-2982', 'TCGA-AB-2834', 'TCGA-AB-2836', 'TCGA-AB-2840', 'TCGA-AB-2879', 'TCGA-AB-2909', 'TCGA-AB-2942', 'TCGA-AB-3011', 'TCGA-AB-2835', 'TCGA-AB-2826', 'TCGA-AB-2850', 'TCGA-AB-2866', 'TCGA-AB-2948', 'TCGA-AB-2842', 'TCGA-AB-2883', 'TCGA-AB-2941', 'TCGA-AB-2954', 'TCGA-AB-2848', 'TCGA-AB-2855', 'TCGA-AB-2933'] 2025-08-01 02:02:53,185 | INFO | pyMut.filters.chrom_sample_filter | Applied sample filter: TCGA-AB-2886 2025-08-01 02:02:53,186 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17|sample:TCGA-AB-2886 2025-08-01 02:02:53,186 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091 2025-08-01 02:02:53,186 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 0 2025-08-01 02:02:53,187 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 2091 2025-08-01 02:02:53,188 | WARNING | pyMut.filters.chrom_sample_filter | No variants found matching the filter criteria: chromosome:chr17|sample:TCGA-AB-2886
=== Chromosome and Sample Filtering Examples === 1. Filter by chromosome 17: Original variants: 2091 Chromosome 17 variants: 99 2. Filter by chromosomes 17 and X: Chromosomes 17 and X variants: 205 3. Filter by first 3 samples: ['TCGA-AB-2886', 'TCGA-AB-2917', 'TCGA-AB-2841'] Filtered by samples variants: 34 Unique samples in filtered data: Tumor_Sample_Barcode 3 Tumor_Sample_Barcode 3 dtype: int64 4. Combined filter - chromosome 17 + first sample: Combined filter variants: 0
2. Genomic Range Filtering (region)¶
This method filters by genomic coordinates using chromosome, start, and end positions. It comes from genomic_range.py
.
print("=== Genomic Range Filtering Examples ===")
# Example 1: Filter a specific region on chromosome 17
print("\n1. Filter chromosome 17, positions 7,500,000 to 8,000,000:")
filtered_region = py_mut.region(chrom='17', start=7500000, end=8000000)
print(f"Original variants: {len(py_mut.data)}")
print(f"Region variants: {len(filtered_region.data)}")
if len(filtered_region.data) > 0:
print("Genes in this region:")
genes_in_region = filtered_region.data['Hugo_Symbol'].value_counts().head(10)
display(genes_in_region)
# Example 2: Filter a smaller region
print("\n2. Filter chromosome 17, positions 7,570,000 to 7,590,000 (TP53 region):")
filtered_tp53_region = py_mut.region(chrom='17', start=7570000, end=7590000)
print(f"TP53 region variants: {len(filtered_tp53_region.data)}")
if len(filtered_tp53_region.data) > 0:
print("Variants in TP53 region:")
display(filtered_tp53_region.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification']].head())
2025-08-01 02:02:53,222 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17' 2025-08-01 02:02:53,222 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization 2025-08-01 02:02:53,228 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful 2025-08-01 02:02:53,228 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7500000-8000000 2025-08-01 02:02:53,229 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,229 | INFO | pyMut.filters.genomic_range | Variants after filter: 20 2025-08-01 02:02:53,229 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2071 2025-08-01 02:02:53,230 | INFO | pyMut.filters.genomic_range | Successfully filtered genomic region: chr17:7500000-8000000
=== Genomic Range Filtering Examples === 1. Filter chromosome 17, positions 7,500,000 to 8,000,000: Original variants: 2091 Region variants: 20 Genes in this region:
Hugo_Symbol TP53 19 GUCY2D 1 Name: count, dtype: int64[pyarrow]
2025-08-01 02:02:53,234 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17' 2025-08-01 02:02:53,235 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization 2025-08-01 02:02:53,240 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful 2025-08-01 02:02:53,241 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7570000-7590000 2025-08-01 02:02:53,241 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,242 | INFO | pyMut.filters.genomic_range | Variants after filter: 19 2025-08-01 02:02:53,242 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2072 2025-08-01 02:02:53,242 | INFO | pyMut.filters.genomic_range | Successfully filtered genomic region: chr17:7570000-7590000
2. Filter chromosome 17, positions 7,570,000 to 7,590,000 (TP53 region): TP53 region variants: 19 Variants in TP53 region:
Hugo_Symbol | CHROM | POS | REF | ALT | Variant_Classification | |
---|---|---|---|---|---|---|
1932 | TP53 | chr17 | 7574003 | G | - | FRAME_SHIFT_DEL |
1933 | TP53 | chr17 | 7574018 | G | A | MISSENSE_MUTATION |
1934 | TP53 | chr17 | 7576897 | G | A | NONSENSE_MUTATION |
1935 | TP53 | chr17 | 7577081 | T | C | MISSENSE_MUTATION |
1936 | TP53 | chr17 | 7577100 | T | C | MISSENSE_MUTATION |
3. Gene-based Filtering (gen_region)¶
This method filters by gene name using the Hugo_Symbol column. It comes from genomic_range.py
.
print("=== Gene-based Filtering Examples ===")
# Get the most common genes in the dataset
common_genes = py_mut.data['Hugo_Symbol'].value_counts().head(5)
print("Most common genes in the dataset:")
display(common_genes)
# Example 1: Filter by TP53 gene
print("\n1. Filter by TP53 gene:")
filtered_tp53 = py_mut.gen_region('TP53')
print(f"TP53 variants: {len(filtered_tp53.data)}")
if len(filtered_tp53.data) > 0:
print("TP53 variant types:")
tp53_variants = filtered_tp53.data['Variant_Classification'].value_counts()
display(tp53_variants)
# Example 2: Filter by the most common gene
most_common_gene = common_genes.index[0]
print(f"\n2. Filter by most common gene ({most_common_gene}):")
filtered_common = py_mut.gen_region(most_common_gene)
print(f"{most_common_gene} variants: {len(filtered_common.data)}")
# Example 3: Filter by multiple genes (using multiple calls)
print("\n3. Filter by multiple genes (FLT3, NPM1, DNMT3A):")
genes_of_interest = ['FLT3', 'NPM1', 'DNMT3A']
for gene in genes_of_interest:
filtered_gene = py_mut.gen_region(gene)
print(f"{gene}: {len(filtered_gene.data)} variants")
=== Gene-based Filtering Examples === Most common genes in the dataset:
Hugo_Symbol FLT3 38 DNMT3A 29 TET2 26 CEBPA 19 TP53 19 Name: count, dtype: int64[pyarrow]
2025-08-01 02:02:53,307 | INFO | pyMut.filters.genomic_range | Applying gene filter for: TP53 2025-08-01 02:02:53,307 | INFO | pyMut.filters.genomic_range | Source format detected: MAF 2025-08-01 02:02:53,308 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column 2025-08-01 02:02:53,308 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol 2025-08-01 02:02:53,312 | INFO | pyMut.filters.genomic_range | Gene filter applied: TP53 2025-08-01 02:02:53,312 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,312 | INFO | pyMut.filters.genomic_range | Variants after filter: 19 2025-08-01 02:02:53,313 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2072 2025-08-01 02:02:53,313 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: TP53
1. Filter by TP53 gene: TP53 variants: 19 TP53 variant types:
Variant_Classification MISSENSE_MUTATION 11 SPLICE_SITE 3 FRAME_SHIFT_DEL 2 FRAME_SHIFT_INS 2 NONSENSE_MUTATION 1 Name: count, dtype: int64[pyarrow]
2025-08-01 02:02:53,316 | INFO | pyMut.filters.genomic_range | Applying gene filter for: FLT3 2025-08-01 02:02:53,317 | INFO | pyMut.filters.genomic_range | Source format detected: MAF 2025-08-01 02:02:53,317 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column 2025-08-01 02:02:53,317 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol 2025-08-01 02:02:53,319 | INFO | pyMut.filters.genomic_range | Gene filter applied: FLT3 2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Variants after filter: 38 2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2053 2025-08-01 02:02:53,320 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: FLT3 2025-08-01 02:02:53,321 | INFO | pyMut.filters.genomic_range | Applying gene filter for: FLT3 2025-08-01 02:02:53,322 | INFO | pyMut.filters.genomic_range | Source format detected: MAF 2025-08-01 02:02:53,322 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column 2025-08-01 02:02:53,323 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol 2025-08-01 02:02:53,325 | INFO | pyMut.filters.genomic_range | Gene filter applied: FLT3 2025-08-01 02:02:53,325 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,326 | INFO | pyMut.filters.genomic_range | Variants after filter: 38 2025-08-01 02:02:53,326 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2053 2025-08-01 02:02:53,326 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: FLT3 2025-08-01 02:02:53,327 | INFO | pyMut.filters.genomic_range | Applying gene filter for: NPM1 2025-08-01 02:02:53,327 | INFO | pyMut.filters.genomic_range | Source format detected: MAF 2025-08-01 02:02:53,327 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column 2025-08-01 02:02:53,328 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol 2025-08-01 02:02:53,329 | INFO | pyMut.filters.genomic_range | Gene filter applied: NPM1 2025-08-01 02:02:53,330 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,330 | INFO | pyMut.filters.genomic_range | Variants after filter: 14 2025-08-01 02:02:53,330 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2077 2025-08-01 02:02:53,330 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: NPM1 2025-08-01 02:02:53,331 | INFO | pyMut.filters.genomic_range | Applying gene filter for: DNMT3A 2025-08-01 02:02:53,332 | INFO | pyMut.filters.genomic_range | Source format detected: MAF 2025-08-01 02:02:53,332 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column 2025-08-01 02:02:53,333 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol 2025-08-01 02:02:53,335 | INFO | pyMut.filters.genomic_range | Gene filter applied: DNMT3A 2025-08-01 02:02:53,335 | INFO | pyMut.filters.genomic_range | Variants before filter: 2091 2025-08-01 02:02:53,336 | INFO | pyMut.filters.genomic_range | Variants after filter: 29 2025-08-01 02:02:53,336 | INFO | pyMut.filters.genomic_range | Variants filtered out: 2062 2025-08-01 02:02:53,336 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: DNMT3A
2. Filter by most common gene (FLT3): FLT3 variants: 38 3. Filter by multiple genes (FLT3, NPM1, DNMT3A): FLT3: 38 variants NPM1: 14 variants DNMT3A: 29 variants
4. PASS Filter Check (pass_filter)¶
This method checks if specific records have FILTER == "PASS". It comes from pass_filter.py
.
Note: This method returns a boolean value, not a filtered dataset.
print("=== PASS Filter Check Examples ===")
# First, let's see what FILTER values are present in our data
if 'FILTER' in py_mut.data.columns:
print("FILTER column values:")
filter_values = py_mut.data['FILTER'].value_counts()
display(filter_values)
# Example 1: Check specific records for PASS filter
print("\n1. Checking specific records for PASS filter:")
# Get a few sample records
sample_records = py_mut.data.head(3)
for idx, row in sample_records.iterrows():
chrom = row['CHROM']
pos = row['POS']
ref = row['REF']
alt = row['ALT']
is_pass = py_mut.pass_filter(chrom=chrom, pos=pos, ref=ref, alt=alt)
print(f"Record {chrom}:{pos} {ref}>{alt} - PASS: {is_pass}")
# Example 2: Check a non-existent record
print("\n2. Checking a non-existent record:")
is_pass_fake = py_mut.pass_filter(chrom='1', pos=999999999, ref='A', alt='T')
print(f"Non-existent record - PASS: {is_pass_fake}")
else:
print("FILTER column not found in the dataset")
print("Available columns:", list(py_mut.data.columns))
=== PASS Filter Check Examples === FILTER column values:
FILTER . 2091 Name: count, dtype: int64
2025-08-01 02:02:53,391 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr9:100077177 T>C 2025-08-01 02:02:53,391 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization 2025-08-01 02:02:53,398 | INFO | pyMut.filters.pass_filter | PASS filter result: False 2025-08-01 02:02:53,399 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr9:100085148 G>A 2025-08-01 02:02:53,399 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization 2025-08-01 02:02:53,405 | INFO | pyMut.filters.pass_filter | PASS filter result: False 2025-08-01 02:02:53,406 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr9:100971322 A>C 2025-08-01 02:02:53,406 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization 2025-08-01 02:02:53,411 | INFO | pyMut.filters.pass_filter | PASS filter result: False 2025-08-01 02:02:53,412 | INFO | pyMut.filters.pass_filter | Checking PASS filter for: chr1:999999999 A>T 2025-08-01 02:02:53,412 | INFO | pyMut.filters.pass_filter | Attempting to use PyArrow optimization 2025-08-01 02:02:53,417 | INFO | pyMut.filters.pass_filter | Record not found: chr1:999999999 A>T
1. Checking specific records for PASS filter: Record chr9:100077177 T>C - PASS: False Record chr9:100085148 G>A - PASS: False Record chr9:100971322 A>C - PASS: False 2. Checking a non-existent record: Non-existent record - PASS: False
5. Combining Multiple Filters¶
You can chain multiple filtering operations to create complex filters.
print("=== Combining Multiple Filters ===")
# Example: Filter by chromosome 17, then by TP53 gene, then by genomic region
print("1. Multi-step filtering: Chromosome 17 → TP53 gene → specific region")
# Step 1: Filter by chromosome 17
step1 = py_mut.filter_by_chrom_sample(chrom='17')
print(f"Step 1 - Chromosome 17: {len(step1.data)} variants")
# Step 2: Filter by TP53 gene
step2 = step1.gen_region('TP53')
print(f"Step 2 - TP53 gene: {len(step2.data)} variants")
# Step 3: Filter by specific region (TP53 locus)
step3 = step2.region(chrom='17', start=7570000, end=7590000)
print(f"Step 3 - TP53 region: {len(step3.data)} variants")
if len(step3.data) > 0:
print("\nFinal filtered results:")
display(step3.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification', 'Tumor_Sample_Barcode']])
# Show the filter history
print(f"\nFilter history: {step3.metadata.filters}")
2025-08-01 02:02:53,451 | INFO | pyMut.filters.chrom_sample_filter | Chromosomes to filter: ['chr17'] 2025-08-01 02:02:53,454 | INFO | pyMut.filters.chrom_sample_filter | Applied chromosome filter: chr17 2025-08-01 02:02:53,455 | INFO | pyMut.filters.chrom_sample_filter | Combined filter applied: chromosome:chr17 2025-08-01 02:02:53,455 | INFO | pyMut.filters.chrom_sample_filter | Variants before filter: 2091 2025-08-01 02:02:53,456 | INFO | pyMut.filters.chrom_sample_filter | Variants after filter: 99 2025-08-01 02:02:53,456 | INFO | pyMut.filters.chrom_sample_filter | Variants filtered out: 1992 2025-08-01 02:02:53,457 | INFO | pyMut.filters.chrom_sample_filter | Successfully applied filter: chromosome:chr17 2025-08-01 02:02:53,457 | INFO | pyMut.filters.genomic_range | Applying gene filter for: TP53 2025-08-01 02:02:53,457 | INFO | pyMut.filters.genomic_range | Source format detected: MAF 2025-08-01 02:02:53,458 | INFO | pyMut.filters.genomic_range | Processing MAF format - looking for Hugo_Symbol column 2025-08-01 02:02:53,458 | INFO | pyMut.filters.genomic_range | Found Hugo_Symbol column: Hugo_Symbol 2025-08-01 02:02:53,462 | INFO | pyMut.filters.genomic_range | Gene filter applied: TP53 2025-08-01 02:02:53,463 | INFO | pyMut.filters.genomic_range | Variants before filter: 99 2025-08-01 02:02:53,463 | INFO | pyMut.filters.genomic_range | Variants after filter: 19 2025-08-01 02:02:53,464 | INFO | pyMut.filters.genomic_range | Variants filtered out: 80 2025-08-01 02:02:53,464 | INFO | pyMut.filters.genomic_range | Successfully filtered data for gene: TP53 2025-08-01 02:02:53,465 | INFO | pyMut.filters.genomic_range | Chromosome formatted: '17' -> 'chr17' 2025-08-01 02:02:53,465 | INFO | pyMut.filters.genomic_range | Attempting to use PyArrow optimization 2025-08-01 02:02:53,467 | INFO | pyMut.filters.genomic_range | PyArrow optimization successful 2025-08-01 02:02:53,468 | INFO | pyMut.filters.genomic_range | Genomic filter applied: chr17:7570000-7590000 2025-08-01 02:02:53,468 | INFO | pyMut.filters.genomic_range | Variants before filter: 19 2025-08-01 02:02:53,468 | INFO | pyMut.filters.genomic_range | Variants after filter: 19 2025-08-01 02:02:53,469 | INFO | pyMut.filters.genomic_range | Variants filtered out: 0 2025-08-01 02:02:53,469 | WARNING | pyMut.filters.genomic_range | Filter did not remove any variants - check region coordinates
=== Combining Multiple Filters === 1. Multi-step filtering: Chromosome 17 → TP53 gene → specific region Step 1 - Chromosome 17: 99 variants Step 2 - TP53 gene: 19 variants Step 3 - TP53 region: 19 variants Final filtered results:
Hugo_Symbol | CHROM | POS | REF | ALT | Variant_Classification | Tumor_Sample_Barcode | |
---|---|---|---|---|---|---|---|
1932 | TP53 | chr17 | 7574003 | G | - | FRAME_SHIFT_DEL | TCGA-AB-2938 |
1933 | TP53 | chr17 | 7574018 | G | A | MISSENSE_MUTATION | TCGA-AB-2904 |
1934 | TP53 | chr17 | 7576897 | G | A | NONSENSE_MUTATION | TCGA-AB-2908 |
1935 | TP53 | chr17 | 7577081 | T | C | MISSENSE_MUTATION | TCGA-AB-2952 |
1936 | TP53 | chr17 | 7577100 | T | C | MISSENSE_MUTATION | TCGA-AB-2829 |
1937 | TP53 | chr17 | 7577121 | G | A | MISSENSE_MUTATION | TCGA-AB-2943 |
1938 | TP53 | chr17 | 7577538 | C | T | MISSENSE_MUTATION | TCGA-AB-2935 |
1939 | TP53 | chr17 | 7577609 | C | T | SPLICE_SITE | TCGA-AB-2829 |
1940 | TP53 | chr17 | 7578181 | - | GCGGCTC | FRAME_SHIFT_INS | TCGA-AB-2820 |
1941 | TP53 | chr17 | 7578206 | T | C | MISSENSE_MUTATION | TCGA-AB-2878 |
1942 | TP53 | chr17 | 7578265 | A | C | MISSENSE_MUTATION | TCGA-AB-2941 |
1943 | TP53 | chr17 | 7578272 | G | A | MISSENSE_MUTATION | TCGA-AB-2885 |
1944 | TP53 | chr17 | 7578394 | T | C | MISSENSE_MUTATION | TCGA-AB-2938 |
1945 | TP53 | chr17 | 7578403 | C | T | MISSENSE_MUTATION | TCGA-AB-2813 |
1946 | TP53 | chr17 | 7578414 | A | - | FRAME_SHIFT_DEL | TCGA-AB-2878 |
1947 | TP53 | chr17 | 7578507 | G | C | MISSENSE_MUTATION | TCGA-AB-2908 |
1948 | TP53 | chr17 | 7578555 | C | T | SPLICE_SITE | TCGA-AB-2868 |
1949 | TP53 | chr17 | 7579312 | C | T | SPLICE_SITE | TCGA-AB-2838 |
1950 | TP53 | chr17 | 7579569 | - | CCATCCAG | FRAME_SHIFT_INS | TCGA-AB-2860 |
Filter history: ['.', 'chromosome:chr17', 'gene_filter:Hugo_Symbol:TP53', 'genomic_region:chr17:7570000-7590000']
Summary¶
This notebook demonstrated the four main filtering methods available in PyMutation:
filter_by_chrom_sample
: Filters by chromosome and/or sample- Parameters:
chrom
(str or list),sample
(str or list),sample_column
(str) - Returns: New PyMutation object with filtered data
- Parameters:
region
: Filters by genomic coordinates- Parameters:
chrom
(str),start
(int),end
(int) - Returns: New PyMutation object with filtered data
- Parameters:
gen_region
: Filters by gene name- Parameters:
gen_name
(str) - Returns: New PyMutation object with filtered data
- Parameters:
pass_filter
: Checks if specific records have FILTER == "PASS"- Parameters:
chrom
(str),pos
(int),ref
(str),alt
(str) - Returns: Boolean value
- Parameters:
All filtering methods preserve the original data structure and update the metadata to track applied filters.