RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.

These details have not been verified by PyPI

Project links

Project description

RFMix-reader

RFMix-reader is a Python package for efficiently reading and processing output files generated by RFMix, a widely used tool for estimating local ancestry in admixed populations.
It employs a lazy loading approach to minimize memory usage, and leverages GPU acceleration for major speedups when available.

Installation

rfmix-reader requires Python 3.11+. Install from PyPI:

pip install rfmix-reader

Installation Options

Basic install (CPU only):
```
pip install rfmix-reader
```
With GPU acceleration (cupy, cudf, dask-cudf, torch):
```
pip install rfmix-reader[gpu]
```
The default GPU extra targets CUDA 12 wheels (cupy-cuda12x, cudf-cu12, dask-cudf-cu12). Use the appropriate CUDA build strings if your environment requires a different version.
With documentation tools (sphinx, sphinx-rtd-theme):
```
pip install rfmix-reader[docs]
```
With testing tools (pytest):
```
pip install rfmix-reader[tests]
```

GPU Notes

PyTorch is not installed in the base package. It is pulled in only when you install the [gpu] extra (pip install rfmix-reader[gpu]) or add it yourself.
Install a PyTorch wheel that matches your platform and CUDA version using the official selector. For example:
- CUDA 12 (Linux/Windows): pip install torch --index-url https://download.pytorch.org/whl/cu121
- CUDA 11 (Linux/Windows): pip install torch --index-url https://download.pytorch.org/whl/cu118
- CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu
RAPIDS (cudf, cupy) wheels are version- and CUDA-specific. See the RAPIDS install guide.
CPU-only installations will still run efficiently, just without GPU acceleration.

Quickstart

from rfmix_reader import read_rfmix

# Load RFMix outputs (two-population admixture example)
file_path = "examples/two_populations/out/"
loci_df, g_anc, local_array = read_rfmix(file_path)

print(loci_df.head())
print(g_anc.head())
print(local_array.shape)
"""See the phasing section below for how to phase per-chromosome outputs and
write them to Zarr with `phase_rfmix_chromosome_to_zarr`."""

Key Features

Lazy Loading: Reads data on-the-fly, reducing memory footprint.
Efficient Access: Query specific loci or regions of interest.
Seamless Integration: Works smoothly with pandas, dask, and other analysis tools.
Loci Imputation: Impute local ancestry loci to dense genotype variant sites.
GPU Acceleration: Automatic CUDA acceleration via PyTorch/CuPy when available.

Simulation Data

Test datasets for two- and three-population admixture are available on Synapse: Synapse Project syn61691659.

Usage

Binary Conversion

RFMix does not generate binary files directly. Use create_binaries to generate them (also available as a CLI):

create-binaries two_pops/out/

from rfmix_reader import create_binaries

create_binaries("two_pops/out/", binary_dir="./binary_files")

Preparing Reference Data for Phasing

Use prepare-reference to convert bgzipped, indexed reference VCF/BCF files into per-chromosome VCF-Zarr stores that the phasing pipeline consumes. The command writes one <chrom>.zarr directory per input file.

prepare-reference -h

usage: prepare-reference [-h] [--chunk-length CHUNK_LENGTH]
                         [--samples-chunk-size SAMPLES_CHUNK_SIZE]
                         [--worker-processes WORKER_PROCESSES]
                         [--verbose | --no-verbose] [--version]
                         output_dir vcf_paths [vcf_paths ...]

Convert one or more bgzipped reference VCF/BCF files into Zarr stores.

positional arguments:
  output_dir            Directory where the Zarr outputs will be written.
  vcf_paths             Paths to reference VCF/BCF files (bgzipped and
                        indexed).

options:
  -h, --help            show this help message and exit
  --chunk-length CHUNK_LENGTH
                        Genomic chunk size for the output Zarr stores
                        (default: 100000).
  --samples-chunk-size SAMPLES_CHUNK_SIZE
                        Chunk size for samples in the output Zarr stores
                        (default: library chosen).
  --worker-processes WORKER_PROCESSES
                        Number of worker processes to use for conversion
                        (default: 0, use library default).
  --verbose, --no-verbose
                        Print progress messages (default: enabled).
  --version             Show the version of the program and exit.

Example data preparation:

# Sample annotations: two columns (no header): sample_id<TAB>group
cat > sample_annotations.tsv <<'EOF'
NA19700	AFR
NA19701	AFR
NA20847	EUR
EOF

# Convert chromosome VCFs into a reference store directory
prepare-reference refs/ 1kg_chr20.vcf.gz 1kg_chr21.vcf.gz \
  --chunk-length 50000 --samples-chunk-size 512

Main Function

Once binaries are available, process RFMix results:

from rfmix_reader import read_rfmix

loci, g_anc, admix = read_rfmix("two_pops/out/")

Three Population Example

Binaries can also be generated on-the-fly within read_rfmix with generate_binary set to True.

loci, g_anc, admix = read_rfmix("examples/three_populations/out/",
                                binary_dir="./binary_files",
                                generate_binary=True)

Phasing data

For optimal memory and computational speed, phasing is done per chromosome.

from rfmix_reader import phase_rfmix_chromosome_to_zarr

# Use the reference store + annotations during phasing
admix = phase_rfmix_chromosome_to_zarr(
    file_prefix="two_pops/out/",
    ref_zarr_root="refs",
    sample_annot_path="sample_annotations.tsv",
    output_path="./phased_chr21.zarr",
    chrom="21",
)

The chunking is suboptimal for phasing, so remember to rechunk before using for optimal processing. This is only needed when loading individual chromosomes. The merged data is already optimized.

local_array = admix["local_ancestry"].chunk({"variant": 20000, "sample": 100})
# Compute into memory, if needed
local_array = local_array.compute()

This also saves the data to Zarr for later merging or data processing.

merge-phased-zarrs ./phased_all.zarr ./phased_chr21.zarr ./phased_chr22.zarr

Loci Imputation

The imputation workflow now lives in rfmix_reader.processing.imputation and is exported as interpolate_array. It interpolates the local ancestry matrix onto a denser variant grid and writes the result to <zarr_outdir>/local-ancestry.zarr as a Zarr array shaped (variants, samples, ancestries).

Inputs

variant_loci_df: a pandas DataFrame defining the variant grid. Provide at least chrom/pos and an i column that points to the source RFMix row index; rows with i set to NaN are treated as missing loci to interpolate. Sort the frame by genomic coordinate, and include pos if you plan to interpolate in base-pair space.
admix: the local ancestry Dask array returned by read_rfmix (shape (loci, samples, ancestries)).
zarr_outdir: an output directory where the new local-ancestry.zarr store will be created.

Key options

interpolation: "linear" (default), "nearest", or "stepwise".
use_bp_positions: set to True to interpolate along variant_loci_df['pos'] rather than treating loci as equally spaced indices.
chunk_size/batch_size: tune how many rows are materialized at a time when filling and interpolating the Zarr array.

Workflow example

import pandas as pd
from pathlib import Path
from rfmix_reader import interpolate_array, read_rfmix

# Load RFMix loci and local ancestry
loci_df, _, admix = read_rfmix("two_pops/out/", binary_dir="./binary_files")

# Build the variant grid by merging genotype sites with the RFMix loci index
variants = pd.read_parquet("genotypes/variants.parquet")  # must include chrom/pos
variants = variants.drop_duplicates(subset=["chrom", "pos"]).sort_values("pos")
variant_loci_df = (
    variants.merge(loci_df.to_pandas(), on=["chrom", "pos"], how="outer", indicator=True)
            .loc[:, ["chrom", "pos", "i", "_merge"]]
)

z = interpolate_array(
    variant_loci_df,
    admix,
    zarr_outdir=Path("./imputed_local_ancestry"),
    interpolation="linear",
    use_bp_positions=True,
    chunk_size=50_000,
)
print(z)

The interpolator uses GPU acceleration transparently when cupy and a CUDA PyTorch build are available; otherwise it falls back to NumPy. All methods other than "hmm" operate on diploid-summed trajectories and preserve the original ancestry dimension.

Phasing workflow (`rfmix_reader.processing.phase`)

rfmix_reader.processing.phase implements gnomix-style tail-flip corrections for local ancestry haplotypes. Phasing now lives outside read_rfmix so you can process each chromosome independently and write outputs directly to Zarr.

Required reference inputs

VCF-Zarr reference (ref_zarr_root): either a single *.zarr store or a directory containing per-chromosome stores (e.g., 1.zarr, chr1.zarr).
Sample annotations (sample_annot_path): two-column file mapping sample_id to group (ancestry label). One representative sample per group is pulled to build reference haplotypes.

Per-chromosome pipeline

(Optional) generate RFMix binary caches with create_binaries.
Call phase_rfmix_chromosome_to_zarr for each chromosome you want to process.
Optionally concatenate those per-chromosome Zarr stores with merge_phased_zarrs.

from rfmix_reader.processing.phase import (
    PhasingConfig,
    merge_phased_zarrs,
    phase_rfmix_chromosome_to_zarr,
)

# Phase chromosome 21 and write to Zarr
dataset = phase_rfmix_chromosome_to_zarr(
    file_prefix="examples/two_populations/out/",
    ref_zarr_root="/refs/1kg_chr_zarr/",
    sample_annot_path="/refs/1kg_annotations.tsv",
    output_path="/tmp/phased_chr21.zarr",
    chrom="21",
)

# Merge multiple per-chromosome Zarr stores
merged = merge_phased_zarrs(
    ["/tmp/phased_chr21.zarr", "/tmp/phased_chr22.zarr"],
    output_path="/tmp/phased_all.zarr",
)

If you need fine-grained control, you can still start from unphased outputs and call phase_admix_dask_with_index directly:

from rfmix_reader import read_rfmix
from rfmix_reader.processing.phase import PhasingConfig, phase_admix_dask_with_index

loci_df, g_anc, admix, X_raw = read_rfmix(
    "examples/two_populations/out/",
    return_original=True,
    chrom="21",
)

config = PhasingConfig(window_size=100, min_block_len=10, max_mismatch_frac=0.3)
phased = phase_admix_dask_with_index(
    admix=admix,
    X_raw=X_raw,
    positions=loci_df.physical_position.to_numpy(),
    chrom=str(loci_df.chromosome.iloc[0]),
    ref_zarr_root="/refs/1kg_chr_zarr/",
    sample_annot_path="/refs/1kg_annotations.tsv",
    config=config,
)

Reading Haptools simulations

Use read_simu to load BGZF-compressed VCF files created by haptools simgenotype --pop_field:

from rfmix_reader import read_simu

loci_df, g_anc, admix = read_simu("/path/to/simulations/")

Haptools does not include the chromosome length in the ##contig header lines, but read_simu requires that metadata to index each VCF. Copy the contigs.txt file Haptools generates from the FASTA you used for simulation and reheader every file with the appropriate contig entry before calling read_simu. The following snippet shows one approach using bcftools and tabix:

CONTIGS="../../three_populations/_m/contigs.txt"
VCFDIR="gt-files"
CHR="chr${SLURM_ARRAY_TASK_ID}"
OUT="${VCFDIR}/${CHR}.vcf.gz"
IN="${VCFDIR}/back/${CHR}.vcf.gz"

CONTIG_LINE=$(grep -w "ID=${CHR}" "$CONTIGS")
if [[ -z "$CONTIG_LINE" ]]; then
    echo "ERROR: No contig line found for ${CHR} in $CONTIGS"
    exit 1
fi

bcftools view -h "$IN" \
    | sed "s/^##contig=<ID=${CHR}>.*/${CONTIG_LINE}/" > header.${CHR}.tmp
bcftools reheader -h header.${CHR}.tmp -o "$OUT" "$IN"
tabix -p vcf "$OUT"

Visualization

read_rfmix, read_flare, and read_simu all return the same (loci_df, g_anc, admix) tuple, so the plotting utilities in rfmix_reader._visualization work identically for RFMix, FLARE, and Haptools-simulated inputs. The snippet below shows the typical workflow for each reader:

from rfmix_reader import (
    plot_ancestry_by_chromosome,
    plot_global_ancestry,
    read_flare,
    read_rfmix,
    read_simu,
)

# RFMix run directory
loci_df, g_anc, admix = read_rfmix("two_pops/out/")
plot_global_ancestry(g_anc, save_path="rfmix_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="rfmix_local.png")

# FLARE output directory (contains *.anc.vcf.gz + global.anc.gz)
loci_df, g_anc, admix = read_flare("flare_runs/chr1/")
plot_global_ancestry(g_anc, save_path="flare_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="flare_local.png")

# Haptools simulations (after reheadering contigs)
loci_df, g_anc, admix = read_simu("/path/to/simulations/")
plot_global_ancestry(g_anc, save_path="simu_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="simu_local.png")

plot_global_ancestry builds per-individual stacked bars of global ancestry while plot_ancestry_by_chromosome summarizes local ancestry along each chromosome, giving you quick visual QC for every supported input format.

Development Install

For contributors:

git clone https://github.com/heart-gen/rfmix_reader.git
cd rfmix_reader
pip install -e ".[gpu,docs,tests]"

Citation

If you use this software, please cite:

Benjamin, K. J. M. (2024). RFMix-reader (Version 0.2.0) [Computer software]. https://github.com/heart-gen/rfmix_reader

Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for local ancestry studies." bioRxiv (2024). DOI: 10.1101/2024.07.13.603370.

Funding

This work was supported by the National Institutes of Health, National Institute on Minority Health and Health Disparities (NIMHD) K99MD016964 / R00MD016964.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

Dec 12, 2025

0.3.0

Dec 5, 2025

0.2.1

Nov 14, 2025

0.2.0

May 2, 2025

0.1.23rc0 pre-release

Mar 29, 2025

0.1.23b0 pre-release

Mar 28, 2025

0.1.23a0 pre-release

Mar 27, 2025

0.1.22rc0 pre-release

Mar 7, 2025

0.1.22b0 pre-release

Mar 6, 2025

0.1.22a0 pre-release

Feb 22, 2025

0.1.21

Jan 7, 2025

0.1.21a1 pre-release

Jan 3, 2025

0.1.21a0 pre-release

Jan 3, 2025

0.1.20

Dec 25, 2024

0.1.20b1 pre-release

Dec 25, 2024

0.1.20b0 pre-release

Dec 25, 2024

0.1.20a1 pre-release

Dec 25, 2024

0.1.20a0 pre-release

Dec 24, 2024

0.1.19

Dec 24, 2024

0.1.19a0 pre-release

Dec 24, 2024

0.1.18

Dec 24, 2024

0.1.17

Jul 23, 2024

0.1.16

Jul 8, 2024

0.1.15

Jul 2, 2024

0.1.14

Jun 28, 2024

0.1.13

Jun 27, 2024

0.1.12

Jun 22, 2024

0.1.12a0 pre-release

Jun 20, 2024

0.1.11

Jun 20, 2024

0.1.11a0 pre-release

Jun 19, 2024

0.1.10

Jun 19, 2024

0.1.10a0 pre-release

Jun 19, 2024

0.1.9

Jun 18, 2024

0.1.9a0 pre-release

Jun 18, 2024

0.1.8

Jun 18, 2024

0.1.7

Jun 12, 2024

0.1.6

Jun 11, 2024

0.1.5

Jun 11, 2024

0.1.4

Jun 3, 2024

0.1.3

Jun 2, 2024

0.1.2

Jun 2, 2024

0.1.0

May 31, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rfmix_reader-0.3.1.tar.gz (83.3 kB view details)

Uploaded Dec 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rfmix_reader-0.3.1-py3-none-any.whl (92.3 kB view details)

Uploaded Dec 12, 2025 Python 3

File details

Details for the file rfmix_reader-0.3.1.tar.gz.

File metadata

Download URL: rfmix_reader-0.3.1.tar.gz
Upload date: Dec 12, 2025
Size: 83.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for rfmix_reader-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`f5d5cff220d92b551d64e402c59723397b105cfa7d7e87a0fe53ecdcd8b35d52`
MD5	`06683b4528fd1a805eded22de2846907`
BLAKE2b-256	`86aa021c8ee7cf0378c6e3e06a7e4b730e152383cce5c5d6fd5f95345475cc96`

See more details on using hashes here.

File details

Details for the file rfmix_reader-0.3.1-py3-none-any.whl.

File metadata

Download URL: rfmix_reader-0.3.1-py3-none-any.whl
Upload date: Dec 12, 2025
Size: 92.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for rfmix_reader-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2841077f353a1c007890f006ab90704772df7b6383b9c09e123e3dcdc114f6cf`
MD5	`bc0a64da634e00d98f89ccfd227162ae`
BLAKE2b-256	`2a69933a178cabe1b3e1002d9a7630e24224e6e1cb38f03ca6e214018fe7f67b`

See more details on using hashes here.

rfmix-reader 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RFMix-reader

Installation

Installation Options

GPU Notes

Quickstart

Key Features

Simulation Data

Usage

Binary Conversion

Preparing Reference Data for Phasing

Main Function

Three Population Example

Phasing data

Loci Imputation

Phasing workflow (rfmix_reader.processing.phase)

Required reference inputs

Per-chromosome pipeline

Reading Haptools simulations

Visualization

Development Install

Citation

Funding

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Phasing workflow (`rfmix_reader.processing.phase`)