Read VCF Files

read_vcf¶

Short description¶

High-performance reader that converts a VCF (or VCF-GZ) file into a PyMutation object, with PyArrow acceleration, automatic caching and optional Tabix indexing.

Signature¶

def read_vcf(
    path: str | pathlib.Path,
    assembly: str,
    create_index: bool = False,
    cache_dir: str | pathlib.Path | None = None
) -> PyMutation:

Parameters¶

Parameter	Type	Required	Description
`path`	`str \\| Path`	Yes	Path to the VCF (plain or `*.vcf.gz`).
`assembly`	`str`	Yes	Genome build identifier, must be `"37"` or `"38"`.
`create_index`	`bool`	No	If `True`, create a Tabix (`.tbi`) index if missing (requires `tabix` in `PATH`). Default `False`.
`cache_dir`	`str \\| Path \\| None`	No	Directory where parsed Parquet caches are stored. `None` (default) writes next to the VCF in `.pymut_cache/`.

Return value¶

PyMutation — a wide-format table of variants plus metadata and sample columns, ready for downstream analysis.

Exceptions¶

FileNotFoundError – VCF file does not exist.
ValueError – invalid assembly, missing required VCF columns, or header problems.
Exception – any other I/O or parsing error (e.g. broken compression, PyArrow failure).

Minimal usage example¶

from pymutation.io import read_vcf

pymut = read_vcf(
    "tumour.vcf.gz",
    assembly="38",
    create_index=True          # build Tabix if needed
)

print(pymut.data.shape)

Standard VCF Columns¶

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SAMPLE_001 | SAMPLE_002
chr1  | 100 | .  | A   | G   | 60   | PASS   | ...  | GT:DP  | 0/1:30    | 1/1:25

Conversion to pyMut Format¶

CHROM | POS | ID | REF | ALT | QUAL | FILTER | SAMPLE_001 | SAMPLE_002 | INFO_parsed
chr1  | 100 | .  | A   | G   | 60   | PASS   | A|G        | G|G        | {...}

Complete Example¶

from pyMut.input import read_vcf
import logging

# Enable logging to monitor progress
logging.basicConfig(level=logging.INFO)

# Load a VCF file with all options enabled
py_mut = read_vcf(
    path="src/pyMut/data/examples/ALL.chr10.vcf.gz",
    fasta="reference/hg38.fasta",
    create_index=True,
    cache_dir="cache/"
)

# Verify that the file was loaded successfully
print(f"Loaded samples: {len(py_mut.samples)}")
print(f"Total variants: {len(py_mut.data)}")
print(f"Unique chromosomes: {py_mut.data['CHROM'].unique()}")

# Genotype information
print(f"Sample columns: {py_mut.samples[:5]}...")  # First 5 samples

# Check metadata
print(f"Source format: {py_mut.metadata.source_format}")
print(f"FASTA file: {py_mut.metadata.fasta}")

Genotype Handling¶

The function automatically converts VCF genotypes into pyMut’s allelic format:

VCF Genotypes (input)¶

FORMAT: GT:DP:GQ
SAMPLE_001: 0/1:30:99    # Heterozygous
SAMPLE_002: 1/1:25:99    # Homozygous alternate
SAMPLE_003: 0/0:35:99    # Homozygous reference

pyMut Format (output)¶

SAMPLE_001: A|G    # REF|ALT
SAMPLE_002: G|G    # ALT|ALT  
SAMPLE_003: A|A    # REF|REF

Cache System¶

# First load — cache is created
py_mut1 = read_vcf("large_file.vcf.gz", cache_dir="cache/")

# Second load — cache is used (much faster)
py_mut2 = read_vcf("large_file.vcf.gz", cache_dir="cache/")