Skip to main content

Fast chain-based liftover for pandas DataFrames

Project description

sumstats-liftover

Fast chain-based liftover for pandas DataFrames using UCSC chain files.

A standalone, vectorized implementation for lifting over genomic coordinates in pandas DataFrames. This library provides a fast and efficient way to convert genomic coordinates from one genome build (e.g., hg19/GRCh37) to another (e.g., hg38/GRCh38) using UCSC chain files.

Features

  • Fast and vectorized: Optimized for large datasets with efficient numpy-based operations
  • Standalone: No external dependencies on UCSC tools or other liftover libraries
  • Flexible: Supports custom column names and coordinate systems (0-based or 1-based)
  • Robust: Handles chromosome name normalization, special chromosomes, and unmapped variants
  • Easy to use: Simple pandas DataFrame interface

Installation

pip install sumstats-liftover

Or install from source:

git clone https://github.com/yourusername/sumstats-liftover.git
cd sumstats-liftover
pip install -e .

Requirements

  • Python >= 3.8
  • numpy >= 1.20.0
  • pandas >= 1.3.0

Quick Start

import pandas as pd
from sumstats_liftover import liftover_df

# Create a dataframe with genomic coordinates
df = pd.DataFrame({
    'CHR': [1, 1, 2],
    'POS': [725932, 725933, 100000],  # hg19 positions
    'EA': ['G', 'A', 'C'],
    'NEA': ['A', 'G', 'T']
})

# Perform liftover from hg19 to hg38
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

print(result)

Usage

Basic Usage

import pandas as pd
from sumstats_liftover import liftover_df

# Your dataframe with genomic coordinates
df = pd.DataFrame({
    'SNPID': ['1:725932_G_A', '1:725933_A_G', '1:737801_T_C'],
    'CHR': [1, 1, 1],
    'POS': [725932, 725933, 737801],  # hg19 positions
})

# Lift over coordinates
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

# Result includes original columns plus:
# - CHR_LIFT: Lifted chromosome
# - POS_LIFT: Lifted position
# - STRAND_LIFT: Strand information ("+" or "-")

Custom Column Names

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="Chromosome",
    pos_col="BP",
    out_chrom_col="CHR_hg38",
    out_pos_col="POS_hg38",
    out_strand_col="STRAND_hg38"
)

Handling Unmapped Variants

By default, unmapped variants are kept with POS_LIFT = -1 and CHR_LIFT = None. To remove them:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    remove_unmapped=True
)

Coordinate Systems

The library supports both 0-based (BED format) and 1-based (GWAS standard) coordinates:

# For 0-based input coordinates
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    one_based_input=False,
    one_based_output=False
)

Chain Files

UCSC chain files can be downloaded from the UCSC Genome Browser. Common chain files include:

  • hg19ToHg38.over.chain.gz - Convert from hg19 to hg38
  • hg38ToHg19.over.chain.gz - Convert from hg38 to hg19
  • hg18ToHg19.over.chain.gz - Convert from hg18 to hg19

API Reference

liftover_df()

Main function for lifting over genomic coordinates in a pandas DataFrame.

Parameters:

  • df (pd.DataFrame): DataFrame containing genomic coordinates
  • chain_path (str): Path to UCSC chain file (.chain or .chain.gz)
  • chrom_col (str, default="CHR"): Column name for chromosome
  • pos_col (str, default="POS"): Column name for position
  • out_chrom_col (str, default="CHR_LIFT"): Output column name for lifted chromosome
  • out_pos_col (str, default="POS_LIFT"): Output column name for lifted position
  • out_strand_col (str, default="STRAND_LIFT"): Output column name for lifted strand
  • one_based_input (bool, default=True): Whether input positions are 1-based
  • one_based_output (bool, default=True): Whether output positions should be 1-based
  • remove_unmapped (bool, default=False): Remove variants that fail to map
  • convert_special_chromosomes (bool, default=True): Convert X→23, Y→24, M/MT→25

Returns:

  • pd.DataFrame: DataFrame with lifted coordinates added as new columns

Example:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

Chromosome Name Handling

The library automatically handles various chromosome name formats:

  • Input formats: 1, chr1, X, chrX, 23 (for X), 24 (for Y), 25 (for M/MT)
  • Output format: By default, special chromosomes are converted to numeric values:
    • X → 23
    • Y → 24
    • M/MT → 25

To keep special chromosomes as strings:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    convert_special_chromosomes=False
)

Testing

Run the test suite:

pytest test_liftover_df.py -v

Example

See example.py for a complete example demonstrating liftover with a real dataset:

python example.py

How It Works

Building Disjoint Intervals

UCSC chain files often contain overlapping segments from different alignment chains. To enable fast and unambiguous coordinate lookup, this library builds a disjoint interval cover that selects the highest-scoring segment at each position.

The Problem:

  • Chain files contain multiple segments that can overlap at the same genomic positions
  • Each position needs to map to exactly one target coordinate
  • We need to choose which segment to use when overlaps occur

The Solution: The library uses a sweep-line algorithm to build non-overlapping (disjoint) intervals:

  1. Parse segments: Extract all alignment segments from the chain file
  2. Build disjoint cover: For overlapping regions, select the segment with the highest score
  3. Create index: Build a sorted array of disjoint intervals for O(log n) lookup

Example: If we have overlapping segments:

  • Segment A: [100, 200) with score 1000
  • Segment B: [150, 250) with score 2000
  • Segment C: [300, 400) with score 500

The disjoint cover becomes:

  • [100, 150) → Segment A (only A covers this region)
  • [150, 250) → Segment B (B has higher score than A in overlap)
  • [300, 400) → Segment C (no overlap)

This ensures each position maps to exactly one target coordinate, enabling fast binary search lookup.

Performance

This implementation is optimized for large datasets and uses vectorized numpy operations for fast coordinate conversion. The disjoint interval index enables O(log n) coordinate lookup, making it typically faster than the original UCSC liftover tool for batch processing of large DataFrames.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use this library in your research, please cite:

@software{sumstats-liftover,
  title = {sumstats-liftover: Fast chain-based liftover for pandas DataFrames},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/sumstats-liftover}
}

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sumstats_liftover-0.1.0.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sumstats_liftover-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file sumstats_liftover-0.1.0.tar.gz.

File metadata

  • Download URL: sumstats_liftover-0.1.0.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for sumstats_liftover-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b10e842c348e28cd1aea43d8454a88ee280bc704786f90c2819cfb51ebcdb768
MD5 9f14ecbe875ef3cd98175f3042e9600d
BLAKE2b-256 7670ad15547121a419eccb8cf30cc7edcb93a4e9a90ca172fd930f5ad4cbeba7

See more details on using hashes here.

File details

Details for the file sumstats_liftover-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sumstats_liftover-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b5fe02ab86c0698c22ee140435c9408e4df54d5aa8ba9de2cdbd10aef952c61
MD5 557e5909d24992c5fbb4d3955791fff3
BLAKE2b-256 109712fff5e30077bfcdd9a80b282e69e06180c46f8c093f1bfdfd3f8e02bf98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page