VEP Annotation Demo¶
This notebook demonstrates how to use VEP (Variant Effect Predictor) to annotate genomic variants from both VCF and MAF files. We'll use the pyMut library's VEP annotation functions to:
- Annotate a VCF file with GRCh38 assembly:
subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf
- Annotate a MAF file with GRCh37 assembly:
tcga_laml.maf.gz
The notebook will show how to load these files, perform VEP annotation, and display the results with minimal output to avoid noise.
In [1]:
Copied!
from pathlib import Path
import os
# Import VEP annotation functions
from pyMut.annotate.vep_annotate import (
wrap_vcf_vep_annotate_unified,
wrap_maf_vep_annotate_protein
)
# Set Perl environment variable for VEP
os.environ['LC_ALL'] = 'C'
os.environ['LANG'] = 'C'
from pathlib import Path
import os
# Import VEP annotation functions
from pyMut.annotate.vep_annotate import (
wrap_vcf_vep_annotate_unified,
wrap_maf_vep_annotate_protein
)
# Set Perl environment variable for VEP
os.environ['LC_ALL'] = 'C'
os.environ['LANG'] = 'C'
Define File Paths¶
We'll define the paths to the input files and VEP resources needed for annotation.
In [2]:
Copied!
# Input files
VCF_FILE = "../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf"
MAF_FILE = "../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz"
# VEP cache directories and FASTA files
VCF_CACHE_DIR = "../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh38"
MAF_CACHE_DIR = "../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh37"
VCF_FASTA = "../../../src/pyMut/data/resources/genome/GRCh38/GRCh38.p14.genome.fa"
MAF_FASTA = "../../../src/pyMut/data/resources/genome/GRCh37/GRCh37.p13.genome.fa"
# Check if files exist
vcf_exists = Path(VCF_FILE).exists()
maf_exists = Path(MAF_FILE).exists()
vcf_cache_exists = Path(VCF_CACHE_DIR).exists()
maf_cache_exists = Path(MAF_CACHE_DIR).exists()
vcf_fasta_exists = Path(VCF_FASTA).exists()
maf_fasta_exists = Path(MAF_FASTA).exists()
print("File availability check:")
print(f"VCF file: {'✓' if vcf_exists else '✗'}")
print(f"MAF file: {'✓' if maf_exists else '✗'}")
print(f"VCF cache: {'✓' if vcf_cache_exists else '✗'}")
print(f"MAF cache: {'✓' if maf_cache_exists else '✗'}")
print(f"VCF FASTA: {'✓' if vcf_fasta_exists else '✗'}")
print(f"MAF FASTA: {'✓' if maf_fasta_exists else '✗'}")
# Input files
VCF_FILE = "../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf"
MAF_FILE = "../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz"
# VEP cache directories and FASTA files
VCF_CACHE_DIR = "../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh38"
MAF_CACHE_DIR = "../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh37"
VCF_FASTA = "../../../src/pyMut/data/resources/genome/GRCh38/GRCh38.p14.genome.fa"
MAF_FASTA = "../../../src/pyMut/data/resources/genome/GRCh37/GRCh37.p13.genome.fa"
# Check if files exist
vcf_exists = Path(VCF_FILE).exists()
maf_exists = Path(MAF_FILE).exists()
vcf_cache_exists = Path(VCF_CACHE_DIR).exists()
maf_cache_exists = Path(MAF_CACHE_DIR).exists()
vcf_fasta_exists = Path(VCF_FASTA).exists()
maf_fasta_exists = Path(MAF_FASTA).exists()
print("File availability check:")
print(f"VCF file: {'✓' if vcf_exists else '✗'}")
print(f"MAF file: {'✓' if maf_exists else '✗'}")
print(f"VCF cache: {'✓' if vcf_cache_exists else '✗'}")
print(f"MAF cache: {'✓' if maf_cache_exists else '✗'}")
print(f"VCF FASTA: {'✓' if vcf_fasta_exists else '✗'}")
print(f"MAF FASTA: {'✓' if maf_fasta_exists else '✗'}")
File availability check: VCF file: ✓ MAF file: ✓ VCF cache: ✓ MAF cache: ✓ VCF FASTA: ✓ MAF FASTA: ✓
Part 1: VCF File VEP Annotation¶
We'll annotate the VCF file using VEP with protein, gene, and variant class annotations.
In [3]:
Copied!
if vcf_exists and vcf_cache_exists and vcf_fasta_exists:
try:
# Perform VEP annotation on VCF file
# We'll annotate with protein, gene, and variant class information
success, result = wrap_vcf_vep_annotate_unified(
VCF_FILE,
VCF_CACHE_DIR,
VCF_FASTA,
annotate_protein=True,
annotate_gene=True,
annotate_variant_class=True,
no_stats=True # Minimize output noise
)
if success:
print("✓ VCF annotation completed successfully")
print(f"Output: {result}")
else:
print("✗ VCF annotation failed")
print(f"Error: {result}")
except Exception as e:
print(f"✗ Error during VCF annotation: {e}")
else:
missing_files = []
if not vcf_exists:
missing_files.append("VCF file")
if not vcf_cache_exists:
missing_files.append("VCF cache directory")
if not vcf_fasta_exists:
missing_files.append("VCF FASTA file")
print(f"✗ Cannot perform VCF annotation. Missing: {', '.join(missing_files)}")
if vcf_exists and vcf_cache_exists and vcf_fasta_exists:
try:
# Perform VEP annotation on VCF file
# We'll annotate with protein, gene, and variant class information
success, result = wrap_vcf_vep_annotate_unified(
VCF_FILE,
VCF_CACHE_DIR,
VCF_FASTA,
annotate_protein=True,
annotate_gene=True,
annotate_variant_class=True,
no_stats=True # Minimize output noise
)
if success:
print("✓ VCF annotation completed successfully")
print(f"Output: {result}")
else:
print("✗ VCF annotation failed")
print(f"Error: {result}")
except Exception as e:
print(f"✗ Error during VCF annotation: {e}")
else:
missing_files = []
if not vcf_exists:
missing_files.append("VCF file")
if not vcf_cache_exists:
missing_files.append("VCF cache directory")
if not vcf_fasta_exists:
missing_files.append("VCF FASTA file")
print(f"✗ Cannot perform VCF annotation. Missing: {', '.join(missing_files)}")
2025-08-01 01:22:55,133 | INFO | pyMut.annotate.vep_annotate | Starting unified VEP annotation for VCF file: ../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf 2025-08-01 01:22:55,134 | INFO | pyMut.annotate.vep_annotate | Extracted from cache: assembly=GRCh38, version=114 2025-08-01 01:22:55,136 | INFO | pyMut.annotate.vep_annotate | Auto-constructed chr synonyms path: ../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh38/homo_sapiens/114_GRCh38/chr_synonyms.txt 2025-08-01 01:22:55,137 | INFO | pyMut.annotate.vep_annotate | Running unified VEP annotation: vep --input_file ../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf --vcf --offline --cache --cache_version 114 --dir_cache ../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh38 --assembly GRCh38 --synonyms ../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh38/homo_sapiens/114_GRCh38/chr_synonyms.txt --fasta ../../../src/pyMut/data/resources/genome/GRCh38/GRCh38.p14.genome.fa --pick --force_overwrite --output_file ../../../src/pyMut/data/examples/VCF/vep_annotation_01220108/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf --protein --uniprot --domains --symbol --variant_class --no_stats 2025-08-01 01:22:56,797 | INFO | pyMut.annotate.vep_annotate | Unified VEP annotation completed successfully
✓ VCF annotation completed successfully Output: VEP output file: ../../../src/pyMut/data/examples/VCF/vep_annotation_01220108/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf
Part 2: MAF File VEP Annotation¶
We'll annotate the MAF file using VEP with protein-level annotations.
In [4]:
Copied!
if maf_exists and maf_cache_exists and maf_fasta_exists:
try:
success, result = wrap_maf_vep_annotate_protein(
MAF_FILE,
MAF_CACHE_DIR,
MAF_FASTA
)
if success:
print("✓ MAF annotation completed successfully")
print(f"Output: {result}")
else:
print("✗ MAF annotation failed")
print(f"Error: {result}")
except Exception as e:
print(f"✗ Error during MAF annotation: {e}")
else:
missing_files = []
if not maf_exists:
missing_files.append("MAF file")
if not maf_cache_exists:
missing_files.append("MAF cache directory")
if not maf_fasta_exists:
missing_files.append("MAF FASTA file")
print(f"✗ Cannot perform MAF annotation. Missing: {', '.join(missing_files)}")
if maf_exists and maf_cache_exists and maf_fasta_exists:
try:
success, result = wrap_maf_vep_annotate_protein(
MAF_FILE,
MAF_CACHE_DIR,
MAF_FASTA
)
if success:
print("✓ MAF annotation completed successfully")
print(f"Output: {result}")
else:
print("✗ MAF annotation failed")
print(f"Error: {result}")
except Exception as e:
print(f"✗ Error during MAF annotation: {e}")
else:
missing_files = []
if not maf_exists:
missing_files.append("MAF file")
if not maf_cache_exists:
missing_files.append("MAF cache directory")
if not maf_fasta_exists:
missing_files.append("MAF FASTA file")
print(f"✗ Cannot perform MAF annotation. Missing: {', '.join(missing_files)}")
2025-08-01 01:22:56,807 | INFO | pyMut.annotate.vep_annotate | Converting MAF file to region format: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz 2025-08-01 01:22:56,808 | INFO | pyMut.annotate.vep_annotate | Converting MAF to region format: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz -> ../../../src/pyMut/data/examples/MAF/tcga_laml.region 2025-08-01 01:22:56,815 | INFO | pyMut.annotate.vep_annotate | Successfully converted MAF to region format: ../../../src/pyMut/data/examples/MAF/tcga_laml.region 2025-08-01 01:22:56,815 | INFO | pyMut.annotate.vep_annotate | Successfully converted MAF to region format: ../../../src/pyMut/data/examples/MAF/tcga_laml.region 2025-08-01 01:22:56,815 | INFO | pyMut.annotate.vep_annotate | Extracted from cache: assembly=GRCh37, version=114 2025-08-01 01:22:56,816 | INFO | pyMut.annotate.vep_annotate | Auto-constructed chr synonyms path: ../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh37/homo_sapiens/114_GRCh37/chr_synonyms.txt 2025-08-01 01:22:56,816 | INFO | pyMut.annotate.vep_annotate | Running VEP annotation: vep --input_file ../../../src/pyMut/data/examples/MAF/tcga_laml.region --format region --offline --cache --cache_version 114 --dir_cache ../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh37 --assembly GRCh37 --synonyms ../../../src/pyMut/data/resources/vep/homo_sapiens_vep_114_GRCh37/homo_sapiens/114_GRCh37/chr_synonyms.txt --fasta ../../../src/pyMut/data/resources/genome/GRCh37/GRCh37.p13.genome.fa --protein --uniprot --domains --symbol --pick --keep_csq --force_overwrite --no_stats --output_file ../../../src/pyMut/data/examples/MAF/vep_annotation_01220108/tcga_laml.maf_vep_protein.txt 2025-08-01 01:23:29,464 | INFO | pyMut.annotate.vep_annotate | VEP annotation completed successfully 2025-08-01 01:23:29,464 | INFO | pyMut.annotate.vep_annotate | Merging VEP annotations with original MAF file... 2025-08-01 01:23:29,464 | INFO | pyMut.utils.merge_vep_annotation | Reading MAF file: ../../../src/pyMut/data/examples/MAF/tcga_laml.maf.gz 2025-08-01 01:23:29,471 | INFO | pyMut.utils.merge_vep_annotation | MAF file loaded: 2207 rows, 17 columns 2025-08-01 01:23:29,471 | INFO | pyMut.utils.merge_vep_annotation | Reading VEP file: ../../../src/pyMut/data/examples/MAF/vep_annotation_01220108/tcga_laml.maf_vep_protein.txt 2025-08-01 01:23:29,481 | INFO | pyMut.utils.merge_vep_annotation | VEP file loaded: 2206 rows, 14 columns 2025-08-01 01:23:29,481 | INFO | pyMut.utils.merge_vep_annotation | Creating region keys for MAF data... 2025-08-01 01:23:29,501 | INFO | pyMut.utils.merge_vep_annotation | Parsing VEP Extra column... 2025-08-01 01:23:29,517 | INFO | pyMut.utils.merge_vep_annotation | Filtered to 2206 meaningful annotations 2025-08-01 01:23:29,518 | INFO | pyMut.utils.merge_vep_annotation | Removing VEP duplicates... 2025-08-01 01:23:29,520 | INFO | pyMut.utils.merge_vep_annotation | Removed 116 duplicate VEP entries 2025-08-01 01:23:29,520 | INFO | pyMut.utils.merge_vep_annotation | Performing optimized merge with DuckDB... 2025-08-01 01:23:29,586 | INFO | pyMut.utils.merge_vep_annotation | Merge completed: 2207 rows, 38 columns 2025-08-01 01:23:29,586 | INFO | pyMut.utils.merge_vep_annotation | Saving annotated file to: ../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz 2025-08-01 01:23:29,629 | INFO | pyMut.annotate.vep_annotate | Successfully merged VEP annotations. Merged file: ../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz 2025-08-01 01:23:29,630 | INFO | pyMut.annotate.vep_annotate | Cleaned up temporary region file: ../../../src/pyMut/data/examples/MAF/tcga_laml.region
✓ MAF annotation completed successfully Output: VEP folder: ../../../src/pyMut/data/examples/MAF/vep_annotation_01220108/tcga_laml.maf_vep_protein.txt, Merged file: ../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz
Summary¶
This notebook demonstrated VEP annotation for both VCF and MAF files:
VCF Annotation¶
- Input: VCF file with GRCh38 coordinates
- Annotations: Protein effects, gene information, and variant classifications
- Function:
wrap_vcf_vep_annotate_unified()
MAF Annotation¶
- Input: MAF file with GRCh37 coordinates
- Annotations: Protein-level effects
- Function:
wrap_maf_vep_annotate_protein()
Key Points¶
- VEP requires appropriate cache directories and FASTA reference files
- Different genome assemblies (GRCh37/GRCh38) require corresponding resources
- The annotation functions handle file format conversion and VEP execution automatically
- Output files contain the original data plus VEP annotation columns
For more detailed VEP configuration options, refer to the function documentation and VEP official documentation.