Skip to main content

Fast chain-based liftover for pandas DataFrames

Project description

sumstats-liftover

Fast chain-based liftover for pandas DataFrames using UCSC chain files.

A standalone, vectorized implementation for lifting over genomic coordinates in pandas DataFrames. This library provides a fast and efficient way to convert genomic coordinates from one genome build (e.g., hg19/GRCh37) to another (e.g., hg38/GRCh38) using UCSC chain files.

Features

  • Fast and vectorized: Optimized for large datasets with efficient numpy-based operations
  • Standalone: No external dependencies on UCSC tools or other liftover libraries
  • Flexible: Supports custom column names and coordinate systems (0-based or 1-based)
  • Robust: Handles chromosome name normalization, special chromosomes, and unmapped variants
  • Easy to use: Simple pandas DataFrame interface

Installation

pip install sumstats-liftover

Or install from source:

git clone https://github.com/yourusername/sumstats-liftover.git
cd sumstats-liftover
pip install -e .

Requirements

  • Python >= 3.8
  • numpy >= 1.20.0
  • pandas >= 1.3.0

Quick Start

import pandas as pd
from sumstats_liftover import liftover_df

# Create a dataframe with genomic coordinates
df = pd.DataFrame({
    'CHR': [1, 1, 2],
    'POS': [725932, 725933, 100000],  # hg19 positions
    'EA': ['G', 'A', 'C'],
    'NEA': ['A', 'G', 'T']
})

# Perform liftover from hg19 to hg38
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

print(result)

Usage

Basic Usage

import pandas as pd
from sumstats_liftover import liftover_df

# Your dataframe with genomic coordinates
df = pd.DataFrame({
    'SNPID': ['1:725932_G_A', '1:725933_A_G', '1:737801_T_C'],
    'CHR': [1, 1, 1],
    'POS': [725932, 725933, 737801],  # hg19 positions
})

# Lift over coordinates
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

# Result includes original columns plus:
# - CHR_LIFT: Lifted chromosome
# - POS_LIFT: Lifted position
# - STRAND_LIFT: Strand information ("+" or "-")

Custom Column Names

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="Chromosome",
    pos_col="BP",
    out_chrom_col="CHR_hg38",
    out_pos_col="POS_hg38",
    out_strand_col="STRAND_hg38"
)

Handling Unmapped Variants

By default, unmapped variants are kept with POS_LIFT = -1 and CHR_LIFT = None. To remove them:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    remove_unmapped=True
)

Coordinate Systems

The library supports both 0-based (BED format) and 1-based (GWAS standard) coordinates:

# For 0-based input coordinates
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    one_based_input=False,
    one_based_output=False
)

Chain Files

UCSC chain files can be downloaded from the UCSC Genome Browser. Common chain files include:

  • hg19ToHg38.over.chain.gz - Convert from hg19 to hg38
  • hg38ToHg19.over.chain.gz - Convert from hg38 to hg19
  • hg18ToHg19.over.chain.gz - Convert from hg18 to hg19

API Reference

liftover_df()

Main function for lifting over genomic coordinates in a pandas DataFrame.

Parameters:

  • df (pd.DataFrame): DataFrame containing genomic coordinates
  • chain_path (str): Path to UCSC chain file (.chain or .chain.gz)
  • chrom_col (str, default="CHR"): Column name for chromosome
  • pos_col (str, default="POS"): Column name for position
  • out_chrom_col (str, default="CHR_LIFT"): Output column name for lifted chromosome
  • out_pos_col (str, default="POS_LIFT"): Output column name for lifted position
  • out_strand_col (str, default="STRAND_LIFT"): Output column name for lifted strand
  • one_based_input (bool, default=True): Whether input positions are 1-based
  • one_based_output (bool, default=True): Whether output positions should be 1-based
  • remove_unmapped (bool, default=False): Remove variants that fail to map
  • convert_special_chromosomes (bool, default=True): Convert X→23, Y→24, M/MT→25

Returns:

  • pd.DataFrame: DataFrame with lifted coordinates added as new columns

Example:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

Chromosome Name Handling

The library automatically handles various chromosome name formats:

  • Input formats: 1, chr1, X, chrX, 23 (for X), 24 (for Y), 25 (for M/MT)
  • Output format: By default, special chromosomes are converted to numeric values:
    • X → 23
    • Y → 24
    • M/MT → 25

To keep special chromosomes as strings:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    convert_special_chromosomes=False
)

Testing

Run the test suite:

pytest test_liftover_df.py -v

Example

See example.py for a complete example demonstrating liftover with a real dataset:

python example.py

How It Works

Building Disjoint Intervals

UCSC chain files often contain overlapping segments from different alignment chains. To enable fast and unambiguous coordinate lookup, this library builds a disjoint interval cover that selects the highest-scoring segment at each position.

The Problem:

  • Chain files contain multiple segments that can overlap at the same genomic positions
  • Each position needs to map to exactly one target coordinate
  • We need to choose which segment to use when overlaps occur

The Solution: The library uses a sweep-line algorithm to build non-overlapping (disjoint) intervals:

  1. Parse segments: Extract all alignment segments from the chain file
  2. Build disjoint cover: For overlapping regions, select the segment with the highest score
  3. Create index: Build a sorted array of disjoint intervals for O(log n) lookup

Example: If we have overlapping segments:

  • Segment A: [100, 200) with score 1000
  • Segment B: [150, 250) with score 2000
  • Segment C: [300, 400) with score 500

The disjoint cover becomes:

  • [100, 150) → Segment A (only A covers this region)
  • [150, 250) → Segment B (B has higher score than A in overlap)
  • [300, 400) → Segment C (no overlap)

This ensures each position maps to exactly one target coordinate, enabling fast binary search lookup.

Performance

This implementation is optimized for large datasets and uses vectorized numpy operations for fast coordinate conversion. The disjoint interval index enables O(log n) coordinate lookup, making it typically faster than the original UCSC liftover tool for batch processing of large DataFrames.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use this library in your research, please cite:

@software{sumstats-liftover,
  title = {sumstats-liftover: Fast chain-based liftover for pandas DataFrames},
  author = {Yunye He},
  year = {2025},
  url = {https://github.com/yourusername/sumstats-liftover}
}

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sumstats_liftover-0.2.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sumstats_liftover-0.2.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file sumstats_liftover-0.2.0.tar.gz.

File metadata

  • Download URL: sumstats_liftover-0.2.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for sumstats_liftover-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c98439d4ab73be4c28ceebcce9d32936da08505a74291e91496d978547454501
MD5 8990bc0d3fdc357c396a9fe200e4bb9d
BLAKE2b-256 90d686c86c969cab343d8613de562643aebb985e89cb9cd44bc0f8789bdbc0bb

See more details on using hashes here.

File details

Details for the file sumstats_liftover-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sumstats_liftover-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ee74475ef32bb9abc19f1e8207ed6dafc284bf59c1af2ce449f9f4ab9bbea1b
MD5 5d8fafb96f9635f642abfe1b49a33783
BLAKE2b-256 fb5c133b5da8a6b8127d4e2e4b51b0784a19d2147ee00640d3c2f3b02c004066

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page