Read VCF Files
read_vcf¶
Short description¶
High-performance reader that converts a VCF (or VCF-GZ) file into a PyMutation
object, with PyArrow acceleration, automatic caching and optional Tabix indexing.
Signature¶
def read_vcf(
path: str | pathlib.Path,
assembly: str,
create_index: bool = False,
cache_dir: str | pathlib.Path | None = None
) -> PyMutation:
Parameters¶
Parameter | Type | Required | Description |
---|---|---|---|
path |
str \| Path |
Yes | Path to the VCF (plain or *.vcf.gz ). |
assembly |
str |
Yes | Genome build identifier, must be "37" or "38" . |
create_index |
bool |
No | If True , create a Tabix (.tbi ) index if missing (requires tabix in PATH ). Default False . |
cache_dir |
str \| Path \| None |
No | Directory where parsed Parquet caches are stored. None (default) writes next to the VCF in .pymut_cache/ . |
Return value¶
PyMutation
— a wide-format table of variants plus metadata and sample columns, ready for downstream analysis.
Exceptions¶
FileNotFoundError
– VCF file does not exist.ValueError
– invalid assembly, missing required VCF columns, or header problems.Exception
– any other I/O or parsing error (e.g. broken compression, PyArrow failure).
Minimal usage example¶
from pymutation.io import read_vcf
pymut = read_vcf(
"tumour.vcf.gz",
assembly="38",
create_index=True # build Tabix if needed
)
print(pymut.data.shape)
Standard VCF Columns¶
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SAMPLE_001 | SAMPLE_002
chr1 | 100 | . | A | G | 60 | PASS | ... | GT:DP | 0/1:30 | 1/1:25
Conversion to pyMut Format¶
CHROM | POS | ID | REF | ALT | QUAL | FILTER | SAMPLE_001 | SAMPLE_002 | INFO_parsed
chr1 | 100 | . | A | G | 60 | PASS | A|G | G|G | {...}
Complete Example¶
from pyMut.input import read_vcf
import logging
# Enable logging to monitor progress
logging.basicConfig(level=logging.INFO)
# Load a VCF file with all options enabled
py_mut = read_vcf(
path="src/pyMut/data/examples/ALL.chr10.vcf.gz",
fasta="reference/hg38.fasta",
create_index=True,
cache_dir="cache/"
)
# Verify that the file was loaded successfully
print(f"Loaded samples: {len(py_mut.samples)}")
print(f"Total variants: {len(py_mut.data)}")
print(f"Unique chromosomes: {py_mut.data['CHROM'].unique()}")
# Genotype information
print(f"Sample columns: {py_mut.samples[:5]}...") # First 5 samples
# Check metadata
print(f"Source format: {py_mut.metadata.source_format}")
print(f"FASTA file: {py_mut.metadata.fasta}")
Genotype Handling¶
The function automatically converts VCF genotypes into pyMut’s allelic format:
VCF Genotypes (input)¶
FORMAT: GT:DP:GQ
SAMPLE_001: 0/1:30:99 # Heterozygous
SAMPLE_002: 1/1:25:99 # Homozygous alternate
SAMPLE_003: 0/0:35:99 # Homozygous reference