filter_by_chrom_sample - Chromosome and Sample Filter¶

The filter_by_chrom_sample method allows filtering PyMutation data by chromosome and/or sample, providing granular control over which data to include in the analysis.

What is filter_by_chrom_sample?¶

It is a versatile method that allows filtering data by chromosome, sample, or both criteria simultaneously. In addition to filtering rows, it also handles column filtering when samples are specified, maintaining data format integrity.

Main Features¶

Dual filtering: By chromosome and/or sample in a single operation
Row and column filtering: Removes both irrelevant rows and columns
Multiple value support: Accepts lists of chromosomes and samples
MAF/VCF compatibility: Handles both formats automatically
Metadata preservation: Records all applied filters
Automatic validation: Verifies the existence of chromosomes and samples
Detailed logging: Provides information about the filtering process

Basic Usage¶

from pyMut.input import read_maf

# Load data
py_mut = read_maf("mutations.maf")

# Filter by chromosome only
chr17_data = py_mut.filter_by_chrom_sample(chrom="chr17")

# Filter by sample only
sample_data = py_mut.filter_by_chrom_sample(sample="TCGA-AB-2802")

# Filter by both criteria
chr17_sample = py_mut.filter_by_chrom_sample(
    chrom="chr17", 
    sample="TCGA-AB-2802"
)

print(f"Mutations in chr17: {len(chr17_data.data)}")
print(f"Mutations in sample: {len(sample_data.data)}")
print(f"Mutations chr17 + sample: {len(chr17_sample.data)}")

Parameters¶

chrom (str, list, optional)¶

Description: Chromosome(s) to filter
Accepted formats: "chr17", "17", ["chr1", "chr17"], ["X", "Y"]
Normalization: Automatically normalized to standard format
Example: "chr17" or ["chr1", "chr2", "chrX"]

sample (str, list, optional)¶

Description: Sample(s) to filter
Format: Sample identifiers as they appear in the data
Example: "TCGA-AB-2802" or ["TCGA-AB-2802", "TCGA-AB-2803"]

sample_column (str, optional)¶

Description: Name of the column containing sample information
Default: "Tumor_Sample_Barcode" (MAF standard)
Usage: For data with non-standard column names

Filtering Behavior¶

Chromosome-Only Filtering¶

# Keeps all columns, filters only rows
chr_filtered = py_mut.filter_by_chrom_sample(chrom="chr17")

# Result: Only mutations in chr17, all samples preserved
print(f"Unique chromosomes: {chr_filtered.data['CHROM'].unique()}")
print(f"Preserved samples: {len(chr_filtered.samples)}")

Sample-Only Filtering¶

# Filters both rows and columns
sample_filtered = py_mut.filter_by_chrom_sample(sample="TCGA-AB-2802")

# Result: Only mutations from the sample, only relevant columns
print(f"Samples in data: {sample_filtered.samples}")
print(f"Sample columns: {[col for col in sample_filtered.data.columns if 'TCGA' in col]}")

Combined Filtering¶

# Applies both filters simultaneously
combined = py_mut.filter_by_chrom_sample(
    chrom=["chr17", "chrX"], 
    sample=["TCGA-AB-2802", "TCGA-AB-2803"]
)

print(f"Chromosomes: {combined.data['CHROM'].unique()}")
print(f"Samples: {combined.samples}")

Detailed Examples¶

Specific Chromosome Analysis¶

from pyMut.input import read_maf
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)

# Load TCGA data
py_mut = read_maf("src/pyMut/data/examples/tcga_laml.maf.gz")
print(f"Original data: {len(py_mut.data)} mutations, {len(py_mut.samples)} samples")

# Analysis of oncologically interesting chromosomes
oncology_chromosomes = ["chr17", "chr13", "chr3", "chr7", "chr12"]

print(f"\n=== Chromosome Analysis ===")
for chrom in oncology_chromosomes:
    try:
        # Filter by chromosome
        chrom_data = py_mut.filter_by_chrom_sample(chrom=chrom)

        print(f"\n{chrom}:")
        print(f"  • Mutations: {len(chrom_data.data)}")

        if len(chrom_data.data) > 0:
            # Most mutated genes in this chromosome
            top_genes = chrom_data.data['Hugo_Symbol'].value_counts().head(3)
            print(f"  • Top genes:")
            for gene, count in top_genes.items():
                print(f"    - {gene}: {count} mutations")

            # Affected samples
            affected_samples = chrom_data.data['Tumor_Sample_Barcode'].nunique()
            print(f"  • Affected samples: {affected_samples}")

    except Exception as e:
        print(f"❌ Error processing {chrom}: {e}")

Sample-Specific Analysis¶

# Select samples of interest
samples_of_interest = py_mut.samples[:5]  # First 5 samples

print(f"\n=== Sample Analysis ===")
for sample in samples_of_interest:
    try:
        # Filter by sample
        sample_data = py_mut.filter_by_chrom_sample(sample=sample)

        print(f"\n{sample}:")
        print(f"  • Total mutations: {len(sample_data.data)}")

        if len(sample_data.data) > 0:
            # Mutation types
            mutation_types = sample_data.data['Variant_Classification'].value_counts()
            print(f"  • Main mutation types:")
            for mut_type, count in mutation_types.head(3).items():
                print(f"    - {mut_type}: {count}")

            # Chromosomal distribution
            chrom_dist = sample_data.data['CHROM'].value_counts()
            print(f"  • Most affected chromosomes:")
            for chrom, count in chrom_dist.head(3).items():
                print(f"    - {chrom}: {count} mutations")

    except Exception as e:
        print(f"❌ Error processing {sample}: {e}")

Multi-Sample Comparative Analysis¶

# Compare multiple samples
sample_groups = {
    "Group_A": ["TCGA-AB-2802", "TCGA-AB-2803", "TCGA-AB-2804"],
    "Group_B": ["TCGA-AB-2805", "TCGA-AB-2806", "TCGA-AB-2807"]
}

print(f"\n=== Multi-Sample Comparative Analysis ===")
for group_name, samples in sample_groups.items():
    try:
        # Filter by sample group
        group_data = py_mut.filter_by_chrom_sample(sample=samples)

        print(f"\n{group_name} ({len(samples)} samples):")
        print(f"  • Total mutations: {len(group_data.data)}")
        print(f"  • Average mutations per sample: {len(group_data.data)/len(samples):.1f}")

        if len(group_data.data) > 0:
            # Most mutated genes in the group
            top_genes = group_data.data['Hugo_Symbol'].value_counts().head(5)
            print(f"  • Most mutated genes:")
            for gene, count in top_genes.items():
                print(f"    - {gene}: {count} mutations")

    except Exception as e:
        print(f"❌ Error processing {group_name}: {e}")

Combined Filtering Analysis¶

# Analysis combining chromosome and sample filters
print(f"\n=== Combined Analysis: chr17 + Specific Samples ===")

# Select samples with high mutation load
high_mutation_samples = []
for sample in py_mut.samples[:10]:  # Check first 10 samples
    sample_data = py_mut.filter_by_chrom_sample(sample=sample)
    if len(sample_data.data) > 50:  # Samples with >50 mutations
        high_mutation_samples.append(sample)

print(f"Samples with high mutation load: {len(high_mutation_samples)}")

if high_mutation_samples:
    # Filter chr17 in high-mutation samples
    chr17_high_mut = py_mut.filter_by_chrom_sample(
        chrom="chr17",
        sample=high_mutation_samples
    )

    print(f"chr17 mutations in high-mutation samples: {len(chr17_high_mut.data)}")

    if len(chr17_high_mut.data) > 0:
        # Genes most affected in chr17
        chr17_genes = chr17_high_mut.data['Hugo_Symbol'].value_counts()
        print(f"Most mutated genes in chr17:")
        for gene, count in chr17_genes.head(5).items():
            print(f"  - {gene}: {count} mutations")

Advanced Usage¶

Filtering by Chromosome Lists¶

# Filter multiple chromosomes simultaneously
sex_chromosomes = py_mut.filter_by_chrom_sample(chrom=["chrX", "chrY"])
autosomes_1_5 = py_mut.filter_by_chrom_sample(chrom=["chr1", "chr2", "chr3", "chr4", "chr5"])

print(f"Mutations in sex chromosomes: {len(sex_chromosomes.data)}")
print(f"Mutations in chromosomes 1-5: {len(autosomes_1_5.data)}")

Filtering by Sample List¶

# Define a list of samples of interest
samples_of_interest = [
    "TCGA-AB-2802",
    "TCGA-AB-2803", 
    "TCGA-AB-2804",
    "TCGA-AB-2805",
    "TCGA-AB-2806"
]

# Filter data to include only specified samples
filtered_data = py_mut.filter_by_chrom_sample(sample=samples_of_interest)
print(f"Mutations in selected samples: {len(filtered_data.data)}")
print(f"Number of samples in filtered data: {len(filtered_data.samples)}")

# You can also combine with chromosome filtering
chr17_selected_samples = py_mut.filter_by_chrom_sample(
    chrom="chr17",
    sample=samples_of_interest
)
print(f"chr17 mutations in selected samples: {len(chr17_selected_samples.data)}")

# Filter by sample subsets for comparative analysis
group_a = ["TCGA-AB-2802", "TCGA-AB-2803"]
group_b = ["TCGA-AB-2804", "TCGA-AB-2805"]

group_a_data = py_mut.filter_by_chrom_sample(sample=group_a)
group_b_data = py_mut.filter_by_chrom_sample(sample=group_b)

print(f"Group A mutations: {len(group_a_data.data)}")
print(f"Group B mutations: {len(group_b_data.data)}")

Error Handling and Validation¶

# The method includes robust validation
try:
    # Valid filtering
    valid_filter = py_mut.filter_by_chrom_sample(chrom="chr17", sample="TCGA-AB-2802")
    print("✅ Valid filtering successful")

except ValueError as e:
    print(f"❌ Validation error: {e}")

# Handle non-existent chromosomes
try:
    invalid_chrom = py_mut.filter_by_chrom_sample(chrom="chr99")
except KeyError as e:
    print(f"❌ Chromosome not found: {e}")

# Handle non-existent samples
try:
    invalid_sample = py_mut.filter_by_chrom_sample(sample="NON_EXISTENT_SAMPLE")
except KeyError as e:
    print(f"❌ Sample not found: {e}")

Integration with Other Methods¶

Chaining with Other Filters¶

# Chain multiple filters
filtered_data = (py_mut
    .filter_by_chrom_sample(chrom="chr17")
    .filter_by_pass()  # If available
    .filter_by_tissue_expression([('BRCA', 5)]))  # If available

print(f"Data after chained filters: {len(filtered_data.data)}")

Combination with Analysis Methods¶

# Filter and then analyze
chr17_data = py_mut.filter_by_chrom_sample(chrom="chr17")

# Perform TMB analysis on filtered data
if hasattr(chr17_data, 'calculate_tmb_analysis'):
    tmb_results = chr17_data.calculate_tmb_analysis()
    print(f"TMB analysis on chr17: {len(tmb_results['analysis'])} samples")

# Generate visualizations
if hasattr(chr17_data, 'summary_plot'):
    fig = chr17_data.summary_plot(title="chr17 Mutation Analysis")

Metadata and Tracking¶

# Filters are automatically recorded in metadata
original = py_mut
filtered = py_mut.filter_by_chrom_sample(chrom="chr17", sample="TCGA-AB-2802")

print("Applied filters:")
for filter_info in filtered.metadata.filters:
    print(f"  - {filter_info}")

# Example output:
# - filter_by_chrom_sample:chrom=chr17,sample=TCGA-AB-2802

Common Use Cases¶

Chromosome-specific analysis: Focus on specific chromosomes of oncological interest
Sample subset analysis: Analyze specific patient cohorts
Quality control: Remove problematic samples or chromosomes
Comparative studies: Compare mutation patterns between sample groups
Performance optimization: Reduce dataset size for faster processing

Technical Notes¶

The method preserves all original metadata and sample information
Chromosome normalization handles formats with and without "chr" prefix
Sample filtering also removes corresponding columns from the data
Compatible with both MAF and VCF-derived data formats
Thread-safe and suitable for parallel processing
Maintains data integrity and format consistency