Skip to main content

RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.

Project description

RFMix-reader

RFMix-reader is a Python package for efficiently reading and processing output files generated by RFMix, a widely used tool for estimating local ancestry in admixed populations.
It employs a lazy loading approach to minimize memory usage, and leverages GPU acceleration for major speedups when available.


Installation

rfmix-reader requires Python 3.11+. Install from PyPI:

pip install rfmix-reader

Installation Options

  • Basic install (CPU only):

    pip install rfmix-reader
    
  • With GPU acceleration (cupy, cudf, dask-cudf, torch):

    pip install rfmix-reader[gpu]
    

    The default GPU extra targets CUDA 12 wheels (cupy-cuda12x, cudf-cu12, dask-cudf-cu12). Use the appropriate CUDA build strings if your environment requires a different version.

  • With documentation tools (sphinx, sphinx-rtd-theme):

    pip install rfmix-reader[docs]
    
  • With testing tools (pytest):

    pip install rfmix-reader[tests]
    

GPU Notes

  • PyTorch is not installed in the base package. It is pulled in only when you install the [gpu] extra (pip install rfmix-reader[gpu]) or add it yourself.
  • Install a PyTorch wheel that matches your platform and CUDA version using the official selector. For example:
    • CUDA 12 (Linux/Windows): pip install torch --index-url https://download.pytorch.org/whl/cu121
    • CUDA 11 (Linux/Windows): pip install torch --index-url https://download.pytorch.org/whl/cu118
    • CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu
  • RAPIDS (cudf, cupy) wheels are version- and CUDA-specific. See the RAPIDS install guide.
  • CPU-only installations will still run efficiently, just without GPU acceleration.

Quickstart

from rfmix_reader import read_rfmix

# Load RFMix outputs (two-population admixture example)
file_path = "examples/two_populations/out/"
loci_df, g_anc, local_array = read_rfmix(file_path)

print(loci_df.head())
print(g_anc.head())
print(local_array.shape)
"""See the phasing section below for how to phase per-chromosome outputs and
write them to Zarr with `phase_rfmix_chromosome_to_zarr`."""

Key Features

  • Lazy Loading: Reads data on-the-fly, reducing memory footprint.
  • Efficient Access: Query specific loci or regions of interest.
  • Seamless Integration: Works smoothly with pandas, dask, and other analysis tools.
  • Loci Imputation: Impute local ancestry loci to dense genotype variant sites.
  • GPU Acceleration: Automatic CUDA acceleration via PyTorch/CuPy when available.

Simulation Data

Test datasets for two- and three-population admixture are available on Synapse: Synapse Project syn61691659.


Usage

Binary Conversion

RFMix does not generate binary files directly. Use create_binaries to generate them (also available as a CLI):

create-binaries two_pops/out/
from rfmix_reader import create_binaries

create_binaries("two_pops/out/", binary_dir="./binary_files")

Preparing Reference Data for Phasing

Use prepare-reference to convert bgzipped, indexed reference VCF/BCF files into per-chromosome VCF-Zarr stores that the phasing pipeline consumes. The command writes one <chrom>.zarr directory per input file.

prepare-reference -h
usage: prepare-reference [-h] [--chunk-length CHUNK_LENGTH]
                         [--samples-chunk-size SAMPLES_CHUNK_SIZE]
                         [--worker-processes WORKER_PROCESSES]
                         [--verbose | --no-verbose] [--version]
                         output_dir vcf_paths [vcf_paths ...]

Convert one or more bgzipped reference VCF/BCF files into Zarr stores.

positional arguments:
  output_dir            Directory where the Zarr outputs will be written.
  vcf_paths             Paths to reference VCF/BCF files (bgzipped and
                        indexed).

options:
  -h, --help            show this help message and exit
  --chunk-length CHUNK_LENGTH
                        Genomic chunk size for the output Zarr stores
                        (default: 100000).
  --samples-chunk-size SAMPLES_CHUNK_SIZE
                        Chunk size for samples in the output Zarr stores
                        (default: library chosen).
  --worker-processes WORKER_PROCESSES
                        Number of worker processes to use for conversion
                        (default: 0, use library default).
  --verbose, --no-verbose
                        Print progress messages (default: enabled).
  --version             Show the version of the program and exit.

Example data preparation:

# Sample annotations: two columns (no header): sample_id<TAB>group
cat > sample_annotations.tsv <<'EOF'
NA19700	AFR
NA19701	AFR
NA20847	EUR
EOF

# Convert chromosome VCFs into a reference store directory
prepare-reference refs/ 1kg_chr20.vcf.gz 1kg_chr21.vcf.gz \
  --chunk-length 50000 --samples-chunk-size 512

Main Function

Once binaries are available, process RFMix results:

from rfmix_reader import read_rfmix

loci, g_anc, admix = read_rfmix("two_pops/out/")

Three Population Example

Binaries can also be generated on-the-fly within read_rfmix with generate_binary set to True.

loci, g_anc, admix = read_rfmix("examples/three_populations/out/",
                                binary_dir="./binary_files",
                                generate_binary=True)

Phasing data

For optimal memory and computational speed, phasing is done per chromosome.

from rfmix_reader import phase_rfmix_chromosome_to_zarr

# Use the reference store + annotations during phasing
admix = phase_rfmix_chromosome_to_zarr(
    file_prefix="two_pops/out/",
    ref_zarr_root="refs",
    sample_annot_path="sample_annotations.tsv",
    output_path="./phased_chr21.zarr",
    chrom="21",
)

The chunking is suboptimal for phasing, so remember to rechunk before using for optimal processing. This is only needed when loading individual chromosomes. The merged data is already optimized.

local_array = admix["local_ancestry"].chunk({"variant": 20000, "sample": 100})
# Compute into memory, if needed
local_array = local_array.compute()

This also saves the data to Zarr for later merging or data processing.

merge-phased-zarrs ./phased_all.zarr ./phased_chr21.zarr ./phased_chr22.zarr

Loci Imputation

The imputation workflow now lives in rfmix_reader.processing.imputation and is exported as interpolate_array. It interpolates the local ancestry matrix onto a denser variant grid and writes the result to <zarr_outdir>/local-ancestry.zarr as a Zarr array shaped (variants, samples, ancestries).

Inputs

  • variant_loci_df: a pandas DataFrame defining the variant grid. Provide at least chrom/pos and an i column that points to the source RFMix row index; rows with i set to NaN are treated as missing loci to interpolate. Sort the frame by genomic coordinate, and include pos if you plan to interpolate in base-pair space.
  • admix: the local ancestry Dask array returned by read_rfmix (shape (loci, samples, ancestries)).
  • zarr_outdir: an output directory where the new local-ancestry.zarr store will be created.

Key options

  • interpolation: "linear" (default), "nearest", or "stepwise".
  • use_bp_positions: set to True to interpolate along variant_loci_df['pos'] rather than treating loci as equally spaced indices.
  • chunk_size/batch_size: tune how many rows are materialized at a time when filling and interpolating the Zarr array.

Workflow example

import pandas as pd
from pathlib import Path
from rfmix_reader import interpolate_array, read_rfmix

# Load RFMix loci and local ancestry
loci_df, _, admix = read_rfmix("two_pops/out/", binary_dir="./binary_files")

# Build the variant grid by merging genotype sites with the RFMix loci index
variants = pd.read_parquet("genotypes/variants.parquet")  # must include chrom/pos
variants = variants.drop_duplicates(subset=["chrom", "pos"]).sort_values("pos")
variant_loci_df = (
    variants.merge(loci_df.to_pandas(), on=["chrom", "pos"], how="outer", indicator=True)
            .loc[:, ["chrom", "pos", "i", "_merge"]]
)

z = interpolate_array(
    variant_loci_df,
    admix,
    zarr_outdir=Path("./imputed_local_ancestry"),
    interpolation="linear",
    use_bp_positions=True,
    chunk_size=50_000,
)
print(z)

The interpolator uses GPU acceleration transparently when cupy and a CUDA PyTorch build are available; otherwise it falls back to NumPy. All methods other than "hmm" operate on diploid-summed trajectories and preserve the original ancestry dimension.

Phasing workflow (rfmix_reader.processing.phase)

rfmix_reader.processing.phase implements gnomix-style tail-flip corrections for local ancestry haplotypes. Phasing now lives outside read_rfmix so you can process each chromosome independently and write outputs directly to Zarr.

Required reference inputs

  • VCF-Zarr reference (ref_zarr_root): either a single *.zarr store or a directory containing per-chromosome stores (e.g., 1.zarr, chr1.zarr).
  • Sample annotations (sample_annot_path): two-column file mapping sample_id to group (ancestry label). One representative sample per group is pulled to build reference haplotypes.

Per-chromosome pipeline

  1. (Optional) generate RFMix binary caches with create_binaries.
  2. Call phase_rfmix_chromosome_to_zarr for each chromosome you want to process.
  3. Optionally concatenate those per-chromosome Zarr stores with merge_phased_zarrs.
from rfmix_reader.processing.phase import (
    PhasingConfig,
    merge_phased_zarrs,
    phase_rfmix_chromosome_to_zarr,
)

# Phase chromosome 21 and write to Zarr
dataset = phase_rfmix_chromosome_to_zarr(
    file_prefix="examples/two_populations/out/",
    ref_zarr_root="/refs/1kg_chr_zarr/",
    sample_annot_path="/refs/1kg_annotations.tsv",
    output_path="/tmp/phased_chr21.zarr",
    chrom="21",
)

# Merge multiple per-chromosome Zarr stores
merged = merge_phased_zarrs(
    ["/tmp/phased_chr21.zarr", "/tmp/phased_chr22.zarr"],
    output_path="/tmp/phased_all.zarr",
)

If you need fine-grained control, you can still start from unphased outputs and call phase_admix_dask_with_index directly:

from rfmix_reader import read_rfmix
from rfmix_reader.processing.phase import PhasingConfig, phase_admix_dask_with_index

loci_df, g_anc, admix, X_raw = read_rfmix(
    "examples/two_populations/out/",
    return_original=True,
    chrom="21",
)

config = PhasingConfig(window_size=100, min_block_len=10, max_mismatch_frac=0.3)
phased = phase_admix_dask_with_index(
    admix=admix,
    X_raw=X_raw,
    positions=loci_df.physical_position.to_numpy(),
    chrom=str(loci_df.chromosome.iloc[0]),
    ref_zarr_root="/refs/1kg_chr_zarr/",
    sample_annot_path="/refs/1kg_annotations.tsv",
    config=config,
)

Reading Haptools simulations

Use read_simu to load BGZF-compressed VCF files created by haptools simgenotype --pop_field:

from rfmix_reader import read_simu

loci_df, g_anc, admix = read_simu("/path/to/simulations/")

Haptools does not include the chromosome length in the ##contig header lines, but read_simu requires that metadata to index each VCF. Copy the contigs.txt file Haptools generates from the FASTA you used for simulation and reheader every file with the appropriate contig entry before calling read_simu. The following snippet shows one approach using bcftools and tabix:

CONTIGS="../../three_populations/_m/contigs.txt"
VCFDIR="gt-files"
CHR="chr${SLURM_ARRAY_TASK_ID}"
OUT="${VCFDIR}/${CHR}.vcf.gz"
IN="${VCFDIR}/back/${CHR}.vcf.gz"

CONTIG_LINE=$(grep -w "ID=${CHR}" "$CONTIGS")
if [[ -z "$CONTIG_LINE" ]]; then
    echo "ERROR: No contig line found for ${CHR} in $CONTIGS"
    exit 1
fi

bcftools view -h "$IN" \
    | sed "s/^##contig=<ID=${CHR}>.*/${CONTIG_LINE}/" > header.${CHR}.tmp
bcftools reheader -h header.${CHR}.tmp -o "$OUT" "$IN"
tabix -p vcf "$OUT"

Visualization

read_rfmix, read_flare, and read_simu all return the same (loci_df, g_anc, admix) tuple, so the plotting utilities in rfmix_reader._visualization work identically for RFMix, FLARE, and Haptools-simulated inputs. The snippet below shows the typical workflow for each reader:

from rfmix_reader import (
    plot_ancestry_by_chromosome,
    plot_global_ancestry,
    read_flare,
    read_rfmix,
    read_simu,
)

# RFMix run directory
loci_df, g_anc, admix = read_rfmix("two_pops/out/")
plot_global_ancestry(g_anc, save_path="rfmix_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="rfmix_local.png")

# FLARE output directory (contains *.anc.vcf.gz + global.anc.gz)
loci_df, g_anc, admix = read_flare("flare_runs/chr1/")
plot_global_ancestry(g_anc, save_path="flare_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="flare_local.png")

# Haptools simulations (after reheadering contigs)
loci_df, g_anc, admix = read_simu("/path/to/simulations/")
plot_global_ancestry(g_anc, save_path="simu_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="simu_local.png")

plot_global_ancestry builds per-individual stacked bars of global ancestry while plot_ancestry_by_chromosome summarizes local ancestry along each chromosome, giving you quick visual QC for every supported input format.


Development Install

For contributors:

git clone https://github.com/heart-gen/rfmix_reader.git
cd rfmix_reader
pip install -e ".[gpu,docs,tests]"

Citation

If you use this software, please cite:

DOI

Benjamin, K. J. M. (2024). RFMix-reader (Version 0.2.0) [Computer software]. https://github.com/heart-gen/rfmix_reader

Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for local ancestry studies." bioRxiv (2024). DOI: 10.1101/2024.07.13.603370.


Funding

This work was supported by the National Institutes of Health, National Institute on Minority Health and Health Disparities (NIMHD) K99MD016964 / R00MD016964.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rfmix_reader-0.3.1.tar.gz (83.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rfmix_reader-0.3.1-py3-none-any.whl (92.3 kB view details)

Uploaded Python 3

File details

Details for the file rfmix_reader-0.3.1.tar.gz.

File metadata

  • Download URL: rfmix_reader-0.3.1.tar.gz
  • Upload date:
  • Size: 83.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for rfmix_reader-0.3.1.tar.gz
Algorithm Hash digest
SHA256 f5d5cff220d92b551d64e402c59723397b105cfa7d7e87a0fe53ecdcd8b35d52
MD5 06683b4528fd1a805eded22de2846907
BLAKE2b-256 86aa021c8ee7cf0378c6e3e06a7e4b730e152383cce5c5d6fd5f95345475cc96

See more details on using hashes here.

File details

Details for the file rfmix_reader-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: rfmix_reader-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 92.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for rfmix_reader-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2841077f353a1c007890f006ab90704772df7b6383b9c09e123e3dcdc114f6cf
MD5 bc0a64da634e00d98f89ccfd227162ae
BLAKE2b-256 2a69933a178cabe1b3e1002d9a7630e24224e6e1cb38f03ca6e214018fe7f67b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page