A Python package for efficient linkage disequilibrium calculation with covariate adjustment

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

ldcov

A Python package for efficient linkage disequilibrium (LD) calculation with covariate adjustment for BGEN format genetic data.

Key Features

BGEN format support: Efficient reading of BGEN v1.1/v1.2 files with mandatory BGI index, including streaming directly from Google Cloud Storage (gs://) without downloading
Covariate adjustment: Remove confounding effects via Frisch-Waugh-Lovell (FWL) projection, with optional pre-computed projection matrices (compute the QR decomposition once, reuse across analyses)
Flexible LD computation: With or without covariate adjustment, optionally filtered and ordered by a z-file or restricted to a genomic region
Multiple output formats: tab-delimited matrix, gzipped long format, or binary .bcor
BCOR index (.bcor.idx): Auto-emitted alongside .bcor outputs for O(1) rsid lookups and partial reads (including over GCS) without scanning metadata
Hail BlockMatrix LD extraction (--ld-bm): Read a partial submatrix of a Hail BlockMatrix LD store (e.g. gnomAD on GCS, Pan-UKB on AWS S3) in pure Python (no Hail/Spark) and export .bcor / .npz, selected by region, z-file, or index range (see below)

Installation

Requirements

Python ≥ 3.9

lazybgen, the BGEN reader dependency, installs automatically from PyPI as a prebuilt binary wheel (Linux, macOS arm64), so no compiler is required.

Standard Installation

# Install from PyPI
pip install ldcov

# Read BlockMatrix LD from AWS S3 (e.g. Pan-UKB) as well
pip install "ldcov[s3]"

# Latest development version from GitHub
pip install git+https://github.com/mkanai/ldcov

# For development
git clone https://github.com/mkanai/ldcov
cd ldcov
pip install -e ".[dev]"

BGEN reading via lazybgen

ldcov reads BGEN files through lazybgen, a standalone high-performance reader (formerly vendored inside ldcov). It statically links zlib-ng (an optimized zlib replacement) and zstd for speed, and supports partial reads directly from local files and cloud object stores (GCS built-in; S3 via the s3 extra).

lazybgen is installed automatically as a dependency. All BGEN files must have accompanying BGI index files (create with bgenix -g file.bgen).

Usage

Cloud Storage (GCS) Support

ldcov can read BGEN files directly from Google Cloud Storage without downloading:

# Read BGEN from GCS
ldcov --bgen gs://bucket/data.bgen --compute-ld --out results

# With covariate adjustment (covariates can also be on GCS)
ldcov --bgen gs://bucket/data.bgen -c gs://bucket/covariates.txt --compute-ld --out results

# BGI index files are automatically downloaded to current directory

Requirements:

gcsfs (installed as a dependency)
BGI index files (.bgen.bgi) must exist alongside BGEN files on GCS
Appropriate GCS credentials configured (via gcloud, service account, etc.)

How it works:

BGEN files are streamed from GCS using efficient range requests
BGI index files are downloaded to current directory (like bcftools)
Smart buffering minimizes API calls and latency
Compatible with all existing ldcov features

Command-Line Interface

The CLI uses flexible flags to control what operations to perform:

# Compute LD only (no covariate adjustment)
ldcov --bgen input.bgen --out output --compute-ld

# Compute LD with covariate adjustment
ldcov --bgen input.bgen --out output --compute-ld -c covariates.txt

# Use specific columns as covariates
ldcov --bgen input.bgen --out output --compute-ld -c covariates.txt --covariate-cols PC1 PC2 PC3

# With region filtering
ldcov --bgen input.bgen --out output --compute-ld --region 1:1000000-2000000

# With Z-file for variant filtering and ordering
ldcov --bgen input.bgen --out output --compute-ld --z variants.z

# Specify custom BGEN index file
ldcov --bgen input.bgen --bgi input.bgen.bgi --out output --compute-ld

# Use covariate file from Google Cloud Storage
ldcov --bgen input.bgen --out output --compute-ld -c gs://bucket/covariates.txt

Pre-computed Projection Matrices

For large-scale analyses, you can pre-compute the covariate projection matrix once and reuse it:

# Step 1: Pre-compute projection matrix from covariates
ldcov --precompute-projection -c covariates.txt --sample data.sample --out myproject

# Step 2: Use pre-computed projection for LD computation
ldcov --bgen chr1.bgen --projection-matrix myproject.proj.npz --compute-ld --out chr1_results
ldcov --bgen chr2.bgen --projection-matrix myproject.proj.npz --compute-ld --out chr2_results
# ... process all chromosomes with the same projection matrix

# Alternative: Compute LD and save projection matrix for future use
ldcov --bgen input.bgen -c covariates.txt --compute-ld --save-projection --out results

This is particularly useful for:

Processing multiple genomic regions with the same covariates
Distributed computing across a cluster
Iterative analyses with different variant filters

Output Files

Based on the flags used, ldcov will create:

--compute-ld --output-format matrix: {out}.ld (default; tab-delimited matrix)
--compute-ld --output-format long: {out}.ld.gz (gzipped long format)
--compute-ld --output-format bcor: {out}.bcor (binary correlation format) and, by default, a {out}.bcor.idx index file (pass --no-bcor-idx to skip)
--precompute-projection or --save-projection: {out}.proj.npz

BCOR Index

When --output-format bcor is selected, ldcov also writes a small .bcor.idx index file that maps rsid to row and records per-variant byte offsets. This lets BcorReader resolve rsid-based queries without scanning the variable-length metadata block, which matters most when reading remote files:

from ldcov.io import BcorReader

# Local or gs://, same API. The .bcor.idx auto-loads if present alongside the .bcor.
reader = BcorReader("gs://bucket/study.bcor")

# Partial read by rsid, fetching only the bytes needed (range-merged, parallelized for GCS).
subset, meta = reader.read_corr_by_rsid(["rs1234", "rs5678", "rs9012"])

# Two-list (asymmetric) query.
subset, meta = reader.read_corr_by_rsid(rsids_a, rsids2=rsids_b)

To generate an index for an existing .bcor file (e.g., LDstore output), run the helper script from a clone of the ldcov repository (it is not installed with the package):

python scripts/make_bcor_idx.py path/to/file.bcor

The index binds to its parent .bcor via a header-level fingerprint, so stale or truncated pairs are detected at load time and the reader falls back gracefully.

Python API

The package provides modular functions for flexibility:

import ldcov

# Load and adjust genotypes
standardized_genotypes, variant_info, sample_ids, means, norms = ldcov.load_and_adjust_genotypes(
    genotype_file="data.bgen",
    covariate_file="covariates.txt",  # Optional
    region="1:1000000-2000000",        # Optional
    z_file="variants.z"                # Optional
)

# Compute LD from standardized genotypes
ldcov.compute_ld_from_standardized(
    standardized_genotypes=standardized_genotypes,
    variant_info=variant_info,
    output_file="output.ld"
)

# Lower-level functions for custom workflows
genotypes, variant_info, sample_ids = ldcov.load_bgen("data.bgen")
standardized, means, norms = ldcov.standardize_genotypes(genotypes)
adjusted = ldcov.regress_out_covariates(standardized, covariates)

# Pre-computed projection matrix workflow
from ldcov.compute.projection import compute_projection_matrix, save_projection_matrix, load_projection_matrix

# Pre-compute projection
projection_data = compute_projection_matrix(
    covariate_file="covariates.txt",
    sample_ids=sample_ids
)
save_projection_matrix(projection_data, "myproject.proj.npz")

# Later: Load and use projection
projection_data = load_projection_matrix("myproject.proj.npz")
adjusted = ldcov.regress_out_covariates(
    standardized_genotypes,
    projection_matrix_Q=projection_data.Q
)

Extracting LD from a Hail BlockMatrix (`--ld-bm`)

Read a submatrix of a Hail BlockMatrix LD store (e.g. gnomAD) directly from cloud storage (no Hail/Spark) and export it as .bcor (plus a .variants.tsv and optional .npz).

One-time: build the variant index

The variant-to-matrix-index mapping lives in the matrix's companion variant_indices.ht. Convert it once to a Parquet variant index on a machine with Hail installed, using the helper script from a clone of the ldcov repository (it is not installed with the package):

python scripts/make_bm_variant_index.py \
    --ht gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.variant_indices.ht \
    --out gnomad_v2.nfe.b37.variant_index.parquet

The builder fails loudly on any multiallelic/monomorphic variant (LD matrices require split variants).

Pre-computed variant indexes (gnomAD and Pan-UKB)

To skip the one-time Hail step, pre-computed variant indexes for the gnomAD and Pan-UKB LD matrices are hosted at gs://ldcov-requester-pays/. The bucket is requester-pays, so reads are billed to your own project (pass --billing-project YOUR_PROJECT to gcloud storage). Both individual Parquet files and per-dataset .tar.gz bundles are available:

# List what's available
gcloud storage ls -r gs://ldcov-requester-pays/ --billing-project YOUR_PROJECT

# Download a single variant index (named <dataset>.<pop>.<build>.variant_index.parquet)
gcloud storage cp gs://ldcov-requester-pays/gnomad_v2.nfe.b37.variant_index.parquet . \
    --billing-project YOUR_PROJECT

# Or grab a whole dataset as a tar.gz bundle
gcloud storage cp gs://ldcov-requester-pays/bundles/gnomad_v2.b37.variant_index.tar.gz . \
    --billing-project YOUR_PROJECT
tar -xzf gnomad_v2.b37.variant_index.tar.gz

Per-dataset bundles:

gnomAD GRCh37: gs://ldcov-requester-pays/bundles/gnomad_v2.b37.variant_index.tar.gz
gnomAD GRCh38: gs://ldcov-requester-pays/bundles/gnomad_v2.b38.variant_index.tar.gz
Pan-UKB GRCh37: gs://ldcov-requester-pays/bundles/panukb.b37.variant_index.tar.gz
Pan-UKB GRCh38: gs://ldcov-requester-pays/bundles/panukb.b38.variant_index.tar.gz

Indexes come in b37 (GRCh37) and b38 (GRCh38) builds; pick the one matching your z-file / region coordinates. gnomAD populations: {afr, amr, asj, eas, est, fin, nfe, nwe, seu}. Pan-UKB populations: {AFR, AMR, CSA, EAS, EUR, MID}. Point --variant-index at the downloaded Parquet; no Hail install required.

Extract by region, z-file, or idx-range

# Genomic region
ldcov --ld-bm \
    --bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm \
    --variant-index gnomad_v2.nfe.b37.variant_index.parquet \
    --region 1:55000000-55100000 \
    --out region

# FINEMAP/SuSiE z-file (variants matched by locus+alleles; swapped alleles are sign-flipped).
ldcov --ld-bm \
    --bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm \
    --variant-index gnomad_v2.nfe.b37.variant_index.parquet \
    --z mystudy.z --out study --output-format both

# Explicit BlockMatrix index range
ldcov --ld-bm \
    --bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm \
    --variant-index gnomad_v2.nfe.b37.variant_index.parquet \
    --idx-range 5000:5500 --out slice

Outputs: PREFIX.bcor (+ .bcor.idx), PREFIX.variants.tsv (matrix-row order, with flipped / matched columns), and PREFIX.npz when --output-format npz|both. Pairs outside the matrix's stored band are filled with NaN (or 0 with --fill zero) and reported.

The needed blocks are fetched concurrently from cloud storage; tune with --fetch-workers N (default 4) and --block-cache N (decoded-block LRU, default 4). For unmatched z-file variants, --on-missing {warn,error,drop} controls the behavior. A ~10K-variant (3 Mb) region exports to .bcor in a few seconds.

Pan-UKB LD on AWS S3

The Pan-UKB LD matrices are public Hail BlockMatrices on S3 in the same format. Reading s3:// requires the S3 extra; pair it with a pre-computed Pan-UKB variant index (above) and extract as usual:

pip install "ldcov[s3]"

# Extract (s3:// is read anonymously by default)
ldcov --ld-bm \
    --bm s3://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.bm \
    --variant-index panukb.EUR.b37.variant_index.parquet \
    --region 1:55000000-55100000 \
    --out panukb_eur

Public-bucket reads are anonymous by default. To use credentials, a custom endpoint, or requester-pays, pass --storage-options as a JSON dict, e.g. --storage-options '{}' to force the normal AWS credential chain, or --storage-options '{"key": "AKIA...", "secret": "..."}'.

Python API (BlockMatrix)

from ldcov.ld_bm import extract_ld

matrix, variants = extract_ld(
    bm_path="gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm",
    variant_index_path="gnomad_v2.nfe.b37.variant_index.parquet",
    region="1:55000000-55100000",
    out="region",
)

Covariate File Format

Covariates should be provided as a text file with:

A sample-ID column named IID by default (must match the BGEN sample IDs). Use a different column name via --covariate-id-col.
Additional columns: Covariate values (numeric or categorical)
Header row with column names

Example:

IID     PC1     PC2     batch
SAMPLE1 0.032   -0.011  A
SAMPLE2 -0.021  0.043   B

Supported formats: CSV, TSV, or whitespace-delimited text files. Can be loaded from local filesystem or Google Cloud Storage (gs://).

Z-file Format

Z-files specify variants to include and their order:

rsid        chromosome  position  allele1  allele2
rs123456    1          1000000   A        G
rs789012    1          1000100   C        T

Allele convention: allele1 = ref, allele2 = alt. This applies throughout ldcov, including --ld-bm extraction, where z-file variants are matched to the LD matrix by locus and alleles and swapped alleles are sign-flipped.

Technical Details

Genotype Standardization

Genotypes are standardized using L2 normalization:

Center by subtracting the mean
Scale by dividing by the L2 norm

This ensures that the dot product of standardized genotypes equals the Pearson correlation coefficient.

Covariate Adjustment

The package uses Frisch-Waugh-Lovell (FWL) projection to remove covariate effects:

Standardize genotypes
Compute QR decomposition of the covariate matrix (with intercept)
Project out covariates using the orthogonal projection matrix Q
The residuals represent genotypes adjusted for covariate effects

For efficiency, the QR decomposition can be pre-computed once and reused across multiple analyses, as the projection matrix Q depends only on the covariates, not the genotypes.

Dependencies

Installed automatically with the package:

lazybgen >= 0.1 (BGEN reader; prebuilt binary wheel)
numpy >= 1.19.0
pandas >= 1.0.0
gcsfs >= 0.7.0
fsspec >= 2021.0.0
lz4 >= 3.1.0
pyarrow >= 6.0.0

Optional extra ldcov[s3] adds s3fs for reading BlockMatrix LD from AWS S3 (e.g. Pan-UKB).

License

MIT License

Citation

Kanai, M. et al. Population-scale multiome immune cell atlas reveals complex disease drivers. medRxiv (2025)

Contact

Masahiro Kanai (mkanai@broadinstitute.org)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mkanai

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Jun 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ldcov-0.4.0.tar.gz (276.5 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ldcov-0.4.0-py3-none-any.whl (65.2 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file ldcov-0.4.0.tar.gz.

File metadata

Download URL: ldcov-0.4.0.tar.gz
Upload date: Jun 24, 2026
Size: 276.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ldcov-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`f09ba2167c396ce8e3d9d451dbcac0cd37228dd40f2fe59b988ff577f404f912`
MD5	`100ac82b9762e43327bcf62be531aca6`
BLAKE2b-256	`9c13a07dbb29ecfb0dca81111843038694385654ab52860ddb16307166a9bd2c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ldcov-0.4.0.tar.gz:

Publisher: publish.yml on mkanai/ldcov

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ldcov-0.4.0.tar.gz
- Subject digest: f09ba2167c396ce8e3d9d451dbcac0cd37228dd40f2fe59b988ff577f404f912
- Sigstore transparency entry: 1939816784
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: mkanai/ldcov@ea8fcc98de2b04a8a94b1fb924439be59de21b3e
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/mkanai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ea8fcc98de2b04a8a94b1fb924439be59de21b3e
- Trigger Event: release

File details

Details for the file ldcov-0.4.0-py3-none-any.whl.

File metadata

Download URL: ldcov-0.4.0-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 65.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ldcov-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a73b2e0ced00f299b3bb489ed37bf0f0240fedf0ebbe42c5a13c044dc68cb416`
MD5	`d62f2c29cf9d6f61cfeb7169da8693a6`
BLAKE2b-256	`abece703918eb2af14ac421b494f82e0d0fec423017baa7d34d9263446635a82`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ldcov-0.4.0-py3-none-any.whl:

Publisher: publish.yml on mkanai/ldcov

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ldcov-0.4.0-py3-none-any.whl
- Subject digest: a73b2e0ced00f299b3bb489ed37bf0f0240fedf0ebbe42c5a13c044dc68cb416
- Sigstore transparency entry: 1939816894
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: mkanai/ldcov@ea8fcc98de2b04a8a94b1fb924439be59de21b3e
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/mkanai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ea8fcc98de2b04a8a94b1fb924439be59de21b3e
- Trigger Event: release

ldcov 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ldcov

Key Features

Installation

Requirements

Standard Installation

BGEN reading via lazybgen

Usage

Cloud Storage (GCS) Support

Command-Line Interface

Pre-computed Projection Matrices

Output Files

BCOR Index

Python API

Extracting LD from a Hail BlockMatrix (--ld-bm)

One-time: build the variant index

Pre-computed variant indexes (gnomAD and Pan-UKB)

Extract by region, z-file, or idx-range

Pan-UKB LD on AWS S3

Python API (BlockMatrix)

Covariate File Format

Z-file Format

Technical Details

Genotype Standardization

Covariate Adjustment

Dependencies

License

Citation

Contact

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Extracting LD from a Hail BlockMatrix (`--ld-bm`)