A Python package for efficient linkage disequilibrium calculation with covariate adjustment
Project description
ldcov
A Python package for efficient linkage disequilibrium (LD) calculation with covariate adjustment for BGEN format genetic data.
Key Features
- BGEN format support: Efficient reading of BGEN v1.1/v1.2 files with mandatory BGI index, including streaming directly from Google Cloud Storage (gs://) without downloading
- Covariate adjustment: Remove confounding effects via Frisch-Waugh-Lovell (FWL) projection, with optional pre-computed projection matrices (compute the QR decomposition once, reuse across analyses)
- Flexible LD computation: With or without covariate adjustment, optionally filtered and ordered by a z-file or restricted to a genomic region
- Multiple output formats: tab-delimited matrix, gzipped long format, or binary
.bcor - BCOR index (
.bcor.idx): Auto-emitted alongside.bcoroutputs for O(1) rsid lookups and partial reads (including over GCS) without scanning metadata - Hail BlockMatrix LD extraction (
--ld-bm): Read a partial submatrix of a HailBlockMatrixLD store (e.g. gnomAD on GCS, Pan-UKB on AWS S3) in pure Python (no Hail/Spark) and export.bcor/.npz, selected by region, z-file, or index range (see below)
Installation
Requirements
- Python ≥ 3.9
lazybgen, the BGEN reader dependency, installs automatically from PyPI as a prebuilt binary wheel (Linux, macOS arm64), so no compiler is required.
Standard Installation
# Install from PyPI
pip install ldcov
# Read BlockMatrix LD from AWS S3 (e.g. Pan-UKB) as well
pip install "ldcov[s3]"
# Latest development version from GitHub
pip install git+https://github.com/mkanai/ldcov
# For development
git clone https://github.com/mkanai/ldcov
cd ldcov
pip install -e ".[dev]"
BGEN reading via lazybgen
ldcov reads BGEN files through lazybgen, a
standalone high-performance reader (formerly vendored inside ldcov). It statically
links zlib-ng (an optimized zlib replacement) and zstd for speed, and
supports partial reads directly from local files and cloud object stores (GCS
built-in; S3 via the s3 extra).
lazybgen is installed automatically as a dependency. All BGEN files must have
accompanying BGI index files (create with bgenix -g file.bgen).
Usage
Cloud Storage (GCS) Support
ldcov can read BGEN files directly from Google Cloud Storage without downloading:
# Read BGEN from GCS
ldcov --bgen gs://bucket/data.bgen --compute-ld --out results
# With covariate adjustment (covariates can also be on GCS)
ldcov --bgen gs://bucket/data.bgen -c gs://bucket/covariates.txt --compute-ld --out results
# BGI index files are automatically downloaded to current directory
Requirements:
- gcsfs (installed as a dependency)
- BGI index files (
.bgen.bgi) must exist alongside BGEN files on GCS - Appropriate GCS credentials configured (via gcloud, service account, etc.)
How it works:
- BGEN files are streamed from GCS using efficient range requests
- BGI index files are downloaded to current directory (like bcftools)
- Smart buffering minimizes API calls and latency
- Compatible with all existing ldcov features
Command-Line Interface
The CLI uses flexible flags to control what operations to perform:
# Compute LD only (no covariate adjustment)
ldcov --bgen input.bgen --out output --compute-ld
# Compute LD with covariate adjustment
ldcov --bgen input.bgen --out output --compute-ld -c covariates.txt
# Use specific columns as covariates
ldcov --bgen input.bgen --out output --compute-ld -c covariates.txt --covariate-cols PC1 PC2 PC3
# With region filtering
ldcov --bgen input.bgen --out output --compute-ld --region 1:1000000-2000000
# With Z-file for variant filtering and ordering
ldcov --bgen input.bgen --out output --compute-ld --z variants.z
# Specify custom BGEN index file
ldcov --bgen input.bgen --bgi input.bgen.bgi --out output --compute-ld
# Use covariate file from Google Cloud Storage
ldcov --bgen input.bgen --out output --compute-ld -c gs://bucket/covariates.txt
Pre-computed Projection Matrices
For large-scale analyses, you can pre-compute the covariate projection matrix once and reuse it:
# Step 1: Pre-compute projection matrix from covariates
ldcov --precompute-projection -c covariates.txt --sample data.sample --out myproject
# Step 2: Use pre-computed projection for LD computation
ldcov --bgen chr1.bgen --projection-matrix myproject.proj.npz --compute-ld --out chr1_results
ldcov --bgen chr2.bgen --projection-matrix myproject.proj.npz --compute-ld --out chr2_results
# ... process all chromosomes with the same projection matrix
# Alternative: Compute LD and save projection matrix for future use
ldcov --bgen input.bgen -c covariates.txt --compute-ld --save-projection --out results
This is particularly useful for:
- Processing multiple genomic regions with the same covariates
- Distributed computing across a cluster
- Iterative analyses with different variant filters
Output Files
Based on the flags used, ldcov will create:
--compute-ld --output-format matrix:{out}.ld(default; tab-delimited matrix)--compute-ld --output-format long:{out}.ld.gz(gzipped long format)--compute-ld --output-format bcor:{out}.bcor(binary correlation format) and, by default, a{out}.bcor.idxindex file (pass--no-bcor-idxto skip)--precompute-projectionor--save-projection:{out}.proj.npz
BCOR Index
When --output-format bcor is selected, ldcov also writes a small .bcor.idx index file that maps rsid to row and records per-variant byte offsets. This lets BcorReader resolve rsid-based queries without scanning the variable-length metadata block, which matters most when reading remote files:
from ldcov.io import BcorReader
# Local or gs://, same API. The .bcor.idx auto-loads if present alongside the .bcor.
reader = BcorReader("gs://bucket/study.bcor")
# Partial read by rsid, fetching only the bytes needed (range-merged, parallelized for GCS).
subset, meta = reader.read_corr_by_rsid(["rs1234", "rs5678", "rs9012"])
# Two-list (asymmetric) query.
subset, meta = reader.read_corr_by_rsid(rsids_a, rsids2=rsids_b)
To generate an index for an existing .bcor file (e.g., LDstore output), run the
helper script from a clone of the ldcov repository (it is not installed with the package):
python scripts/make_bcor_idx.py path/to/file.bcor
The index binds to its parent .bcor via a header-level fingerprint, so stale or truncated pairs are detected at load time and the reader falls back gracefully.
Python API
The package provides modular functions for flexibility:
import ldcov
# Load and adjust genotypes
standardized_genotypes, variant_info, sample_ids, means, norms = ldcov.load_and_adjust_genotypes(
genotype_file="data.bgen",
covariate_file="covariates.txt", # Optional
region="1:1000000-2000000", # Optional
z_file="variants.z" # Optional
)
# Compute LD from standardized genotypes
ldcov.compute_ld_from_standardized(
standardized_genotypes=standardized_genotypes,
variant_info=variant_info,
output_file="output.ld"
)
# Lower-level functions for custom workflows
genotypes, variant_info, sample_ids = ldcov.load_bgen("data.bgen")
standardized, means, norms = ldcov.standardize_genotypes(genotypes)
adjusted = ldcov.regress_out_covariates(standardized, covariates)
# Pre-computed projection matrix workflow
from ldcov.compute.projection import compute_projection_matrix, save_projection_matrix, load_projection_matrix
# Pre-compute projection
projection_data = compute_projection_matrix(
covariate_file="covariates.txt",
sample_ids=sample_ids
)
save_projection_matrix(projection_data, "myproject.proj.npz")
# Later: Load and use projection
projection_data = load_projection_matrix("myproject.proj.npz")
adjusted = ldcov.regress_out_covariates(
standardized_genotypes,
projection_matrix_Q=projection_data.Q
)
Extracting LD from a Hail BlockMatrix (--ld-bm)
Read a submatrix of a Hail BlockMatrix LD store (e.g. gnomAD) directly from cloud storage
(no Hail/Spark) and export it as .bcor (plus a .variants.tsv and optional .npz).
One-time: build the variant index
The variant-to-matrix-index mapping lives in the matrix's companion variant_indices.ht. Convert it
once to a Parquet variant index on a machine with Hail installed, using the helper script from a
clone of the ldcov repository (it is not installed with the package):
python scripts/make_bm_variant_index.py \
--ht gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.variant_indices.ht \
--out gnomad_v2.nfe.b37.variant_index.parquet
The builder fails loudly on any multiallelic/monomorphic variant (LD matrices require split variants).
Pre-computed variant indexes (gnomAD and Pan-UKB)
To skip the one-time Hail step, pre-computed variant indexes for the gnomAD and Pan-UKB LD matrices
are hosted at gs://ldcov-requester-pays/. The bucket is requester-pays, so reads are billed to your
own project (pass --billing-project YOUR_PROJECT to gcloud storage). Both individual Parquet files
and per-dataset .tar.gz bundles are available:
# List what's available
gcloud storage ls -r gs://ldcov-requester-pays/ --billing-project YOUR_PROJECT
# Download a single variant index (named <dataset>.<pop>.<build>.variant_index.parquet)
gcloud storage cp gs://ldcov-requester-pays/gnomad_v2.nfe.b37.variant_index.parquet . \
--billing-project YOUR_PROJECT
# Or grab a whole dataset as a tar.gz bundle
gcloud storage cp gs://ldcov-requester-pays/bundles/gnomad_v2.b37.variant_index.tar.gz . \
--billing-project YOUR_PROJECT
tar -xzf gnomad_v2.b37.variant_index.tar.gz
Per-dataset bundles:
- gnomAD GRCh37:
gs://ldcov-requester-pays/bundles/gnomad_v2.b37.variant_index.tar.gz - gnomAD GRCh38:
gs://ldcov-requester-pays/bundles/gnomad_v2.b38.variant_index.tar.gz - Pan-UKB GRCh37:
gs://ldcov-requester-pays/bundles/panukb.b37.variant_index.tar.gz - Pan-UKB GRCh38:
gs://ldcov-requester-pays/bundles/panukb.b38.variant_index.tar.gz
Indexes come in b37 (GRCh37) and b38 (GRCh38) builds; pick the one matching your z-file / region
coordinates. gnomAD populations: {afr, amr, asj, eas, est, fin, nfe, nwe, seu}. Pan-UKB populations:
{AFR, AMR, CSA, EAS, EUR, MID}. Point --variant-index at the downloaded Parquet; no Hail install
required.
Extract by region, z-file, or idx-range
# Genomic region
ldcov --ld-bm \
--bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm \
--variant-index gnomad_v2.nfe.b37.variant_index.parquet \
--region 1:55000000-55100000 \
--out region
# FINEMAP/SuSiE z-file (variants matched by locus+alleles; swapped alleles are sign-flipped).
ldcov --ld-bm \
--bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm \
--variant-index gnomad_v2.nfe.b37.variant_index.parquet \
--z mystudy.z --out study --output-format both
# Explicit BlockMatrix index range
ldcov --ld-bm \
--bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm \
--variant-index gnomad_v2.nfe.b37.variant_index.parquet \
--idx-range 5000:5500 --out slice
Outputs: PREFIX.bcor (+ .bcor.idx), PREFIX.variants.tsv (matrix-row order, with flipped /
matched columns), and PREFIX.npz when --output-format npz|both. Pairs outside the matrix's stored
band are filled with NaN (or 0 with --fill zero) and reported.
The needed blocks are fetched concurrently from cloud storage; tune with --fetch-workers N
(default 4) and --block-cache N (decoded-block LRU, default 4). For unmatched z-file variants,
--on-missing {warn,error,drop} controls the behavior. A ~10K-variant (3 Mb) region exports to
.bcor in a few seconds.
Pan-UKB LD on AWS S3
The Pan-UKB LD matrices are public Hail BlockMatrices on S3
in the same format. Reading s3:// requires the S3 extra; pair it with a pre-computed Pan-UKB variant
index (above) and extract as usual:
pip install "ldcov[s3]"
# Extract (s3:// is read anonymously by default)
ldcov --ld-bm \
--bm s3://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.bm \
--variant-index panukb.EUR.b37.variant_index.parquet \
--region 1:55000000-55100000 \
--out panukb_eur
Public-bucket reads are anonymous by default. To use credentials, a custom endpoint, or
requester-pays, pass --storage-options as a JSON dict, e.g.
--storage-options '{}' to force the normal AWS credential chain, or
--storage-options '{"key": "AKIA...", "secret": "..."}'.
Python API (BlockMatrix)
from ldcov.ld_bm import extract_ld
matrix, variants = extract_ld(
bm_path="gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm",
variant_index_path="gnomad_v2.nfe.b37.variant_index.parquet",
region="1:55000000-55100000",
out="region",
)
Covariate File Format
Covariates should be provided as a text file with:
- A sample-ID column named
IIDby default (must match the BGEN sample IDs). Use a different column name via--covariate-id-col. - Additional columns: Covariate values (numeric or categorical)
- Header row with column names
Example:
IID PC1 PC2 batch
SAMPLE1 0.032 -0.011 A
SAMPLE2 -0.021 0.043 B
Supported formats: CSV, TSV, or whitespace-delimited text files. Can be loaded from local filesystem or Google Cloud Storage (gs://).
Z-file Format
Z-files specify variants to include and their order:
rsid chromosome position allele1 allele2
rs123456 1 1000000 A G
rs789012 1 1000100 C T
Allele convention: allele1 = ref, allele2 = alt. This applies throughout ldcov, including
--ld-bm extraction, where z-file variants are matched to the LD matrix by locus and alleles and
swapped alleles are sign-flipped.
Technical Details
Genotype Standardization
Genotypes are standardized using L2 normalization:
- Center by subtracting the mean
- Scale by dividing by the L2 norm
This ensures that the dot product of standardized genotypes equals the Pearson correlation coefficient.
Covariate Adjustment
The package uses Frisch-Waugh-Lovell (FWL) projection to remove covariate effects:
- Standardize genotypes
- Compute QR decomposition of the covariate matrix (with intercept)
- Project out covariates using the orthogonal projection matrix Q
- The residuals represent genotypes adjusted for covariate effects
For efficiency, the QR decomposition can be pre-computed once and reused across multiple analyses, as the projection matrix Q depends only on the covariates, not the genotypes.
Dependencies
Installed automatically with the package:
- lazybgen >= 0.1 (BGEN reader; prebuilt binary wheel)
- numpy >= 1.19.0
- pandas >= 1.0.0
- gcsfs >= 0.7.0
- fsspec >= 2021.0.0
- lz4 >= 3.1.0
- pyarrow >= 6.0.0
Optional extra ldcov[s3] adds s3fs for reading BlockMatrix LD from AWS S3 (e.g. Pan-UKB).
License
MIT License
Citation
Kanai, M. et al. Population-scale multiome immune cell atlas reveals complex disease drivers. medRxiv (2025)
Contact
Masahiro Kanai (mkanai@broadinstitute.org)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ldcov-0.4.0.tar.gz.
File metadata
- Download URL: ldcov-0.4.0.tar.gz
- Upload date:
- Size: 276.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f09ba2167c396ce8e3d9d451dbcac0cd37228dd40f2fe59b988ff577f404f912
|
|
| MD5 |
100ac82b9762e43327bcf62be531aca6
|
|
| BLAKE2b-256 |
9c13a07dbb29ecfb0dca81111843038694385654ab52860ddb16307166a9bd2c
|
Provenance
The following attestation bundles were made for ldcov-0.4.0.tar.gz:
Publisher:
publish.yml on mkanai/ldcov
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ldcov-0.4.0.tar.gz -
Subject digest:
f09ba2167c396ce8e3d9d451dbcac0cd37228dd40f2fe59b988ff577f404f912 - Sigstore transparency entry: 1939816784
- Sigstore integration time:
-
Permalink:
mkanai/ldcov@ea8fcc98de2b04a8a94b1fb924439be59de21b3e -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/mkanai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ea8fcc98de2b04a8a94b1fb924439be59de21b3e -
Trigger Event:
release
-
Statement type:
File details
Details for the file ldcov-0.4.0-py3-none-any.whl.
File metadata
- Download URL: ldcov-0.4.0-py3-none-any.whl
- Upload date:
- Size: 65.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a73b2e0ced00f299b3bb489ed37bf0f0240fedf0ebbe42c5a13c044dc68cb416
|
|
| MD5 |
d62f2c29cf9d6f61cfeb7169da8693a6
|
|
| BLAKE2b-256 |
abece703918eb2af14ac421b494f82e0d0fec423017baa7d34d9263446635a82
|
Provenance
The following attestation bundles were made for ldcov-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on mkanai/ldcov
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ldcov-0.4.0-py3-none-any.whl -
Subject digest:
a73b2e0ced00f299b3bb489ed37bf0f0240fedf0ebbe42c5a13c044dc68cb416 - Sigstore transparency entry: 1939816894
- Sigstore integration time:
-
Permalink:
mkanai/ldcov@ea8fcc98de2b04a8a94b1fb924439be59de21b3e -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/mkanai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ea8fcc98de2b04a8a94b1fb924439be59de21b3e -
Trigger Event:
release
-
Statement type: