Fast chain-based liftover for pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

sumstats-liftover

Fast chain-based liftover for pandas DataFrames using UCSC chain files.

A standalone, vectorized implementation for lifting over genomic coordinates in pandas DataFrames. This library provides a fast and efficient way to convert genomic coordinates from one genome build (e.g., hg19/GRCh37) to another (e.g., hg38/GRCh38) using UCSC chain files.

Features

Fast and vectorized: Optimized for large datasets with efficient numpy-based operations
Standalone: No external dependencies on UCSC tools or other liftover libraries
Flexible: Supports custom column names and coordinate systems (0-based or 1-based)
Robust: Handles chromosome name normalization, special chromosomes, and unmapped variants
Easy to use: Simple pandas DataFrame interface

Installation

pip install sumstats-liftover

Or install from source:

git clone https://github.com/yourusername/sumstats-liftover.git
cd sumstats-liftover
pip install -e .

Requirements

Python >= 3.8
numpy >= 1.20.0
pandas >= 1.3.0

Quick Start

import pandas as pd
from sumstats_liftover import liftover_df

# Create a dataframe with genomic coordinates
df = pd.DataFrame({
    'CHR': [1, 1, 2],
    'POS': [725932, 725933, 100000],  # hg19 positions
    'EA': ['G', 'A', 'C'],
    'NEA': ['A', 'G', 'T']
})

# Perform liftover from hg19 to hg38
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

print(result)

Usage

Basic Usage

import pandas as pd
from sumstats_liftover import liftover_df

# Your dataframe with genomic coordinates
df = pd.DataFrame({
    'SNPID': ['1:725932_G_A', '1:725933_A_G', '1:737801_T_C'],
    'CHR': [1, 1, 1],
    'POS': [725932, 725933, 737801],  # hg19 positions
})

# Lift over coordinates
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

# Result includes original columns plus:
# - CHR_LIFT: Lifted chromosome
# - POS_LIFT: Lifted position
# - STRAND_LIFT: Strand information ("+" or "-")

Custom Column Names

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="Chromosome",
    pos_col="BP",
    out_chrom_col="CHR_hg38",
    out_pos_col="POS_hg38",
    out_strand_col="STRAND_hg38"
)

Handling Unmapped Variants

By default, unmapped variants are kept with POS_LIFT = -1 and CHR_LIFT = None. To remove them:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    remove_unmapped=True
)

Coordinate Systems

The library supports both 0-based (BED format) and 1-based (GWAS standard) coordinates:

# For 0-based input coordinates
result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    one_based_input=False,
    one_based_output=False
)

Chain Files

UCSC chain files can be downloaded from the UCSC Genome Browser. Common chain files include:

hg19ToHg38.over.chain.gz - Convert from hg19 to hg38
hg38ToHg19.over.chain.gz - Convert from hg38 to hg19
hg18ToHg19.over.chain.gz - Convert from hg18 to hg19

API Reference

`liftover_df()`

Main function for lifting over genomic coordinates in a pandas DataFrame.

Parameters:

df (pd.DataFrame): DataFrame containing genomic coordinates
chain_path (str): Path to UCSC chain file (.chain or .chain.gz)
chrom_col (str, default="CHR"): Column name for chromosome
pos_col (str, default="POS"): Column name for position
out_chrom_col (str, default="CHR_LIFT"): Output column name for lifted chromosome
out_pos_col (str, default="POS_LIFT"): Output column name for lifted position
out_strand_col (str, default="STRAND_LIFT"): Output column name for lifted strand
one_based_input (bool, default=True): Whether input positions are 1-based
one_based_output (bool, default=True): Whether output positions should be 1-based
remove_unmapped (bool, default=False): Remove variants that fail to map
convert_special_chromosomes (bool, default=True): Convert X→23, Y→24, M/MT→25

Returns:

pd.DataFrame: DataFrame with lifted coordinates added as new columns

Example:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    chrom_col="CHR",
    pos_col="POS"
)

Chromosome Name Handling

The library automatically handles various chromosome name formats:

Input formats: 1, chr1, X, chrX, 23 (for X), 24 (for Y), 25 (for M/MT)
Output format: By default, special chromosomes are converted to numeric values:
- X → 23
- Y → 24
- M/MT → 25

To keep special chromosomes as strings:

result = liftover_df(
    df,
    chain_path="hg19ToHg38.over.chain.gz",
    convert_special_chromosomes=False
)

Testing

Run the test suite:

pytest test_liftover_df.py -v

Example

See example.py for a complete example demonstrating liftover with a real dataset:

python example.py

How It Works

Building Disjoint Intervals

UCSC chain files often contain overlapping segments from different alignment chains. To enable fast and unambiguous coordinate lookup, this library builds a disjoint interval cover that selects the highest-scoring segment at each position.

The Problem:

Chain files contain multiple segments that can overlap at the same genomic positions
Each position needs to map to exactly one target coordinate
We need to choose which segment to use when overlaps occur

The Solution: The library uses a sweep-line algorithm to build non-overlapping (disjoint) intervals:

Parse segments: Extract all alignment segments from the chain file
Build disjoint cover: For overlapping regions, select the segment with the highest score
Create index: Build a sorted array of disjoint intervals for O(log n) lookup

Example: If we have overlapping segments:

Segment A: [100, 200) with score 1000
Segment B: [150, 250) with score 2000
Segment C: [300, 400) with score 500

The disjoint cover becomes:

[100, 150) → Segment A (only A covers this region)
[150, 250) → Segment B (B has higher score than A in overlap)
[300, 400) → Segment C (no overlap)

This ensures each position maps to exactly one target coordinate, enabling fast binary search lookup.

Performance

This implementation is optimized for large datasets and uses vectorized numpy operations for fast coordinate conversion. The disjoint interval index enables O(log n) coordinate lookup, making it typically faster than the original UCSC liftover tool for batch processing of large DataFrames.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use this library in your research, please cite:

@software{sumstats-liftover,
  title = {sumstats-liftover: Fast chain-based liftover for pandas DataFrames},
  author = {Yunye He},
  year = {2025},
  url = {https://github.com/yourusername/sumstats-liftover}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Dec 25, 2025

1.0.0

Dec 23, 2025

This version

0.2.0

Dec 22, 2025

0.1.0

Dec 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sumstats_liftover-0.2.0.tar.gz (14.7 kB view details)

Uploaded Dec 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sumstats_liftover-0.2.0-py3-none-any.whl (13.2 kB view details)

Uploaded Dec 22, 2025 Python 3

File details

Details for the file sumstats_liftover-0.2.0.tar.gz.

File metadata

Download URL: sumstats_liftover-0.2.0.tar.gz
Upload date: Dec 22, 2025
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for sumstats_liftover-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c98439d4ab73be4c28ceebcce9d32936da08505a74291e91496d978547454501`
MD5	`8990bc0d3fdc357c396a9fe200e4bb9d`
BLAKE2b-256	`90d686c86c969cab343d8613de562643aebb985e89cb9cd44bc0f8789bdbc0bb`

See more details on using hashes here.

File details

Details for the file sumstats_liftover-0.2.0-py3-none-any.whl.

File metadata

Download URL: sumstats_liftover-0.2.0-py3-none-any.whl
Upload date: Dec 22, 2025
Size: 13.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for sumstats_liftover-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ee74475ef32bb9abc19f1e8207ed6dafc284bf59c1af2ce449f9f4ab9bbea1b`
MD5	`5d8fafb96f9635f642abfe1b49a33783`
BLAKE2b-256	`fb5c133b5da8a6b8127d4e2e4b51b0784a19d2147ee00640d3c2f3b02c004066`

See more details on using hashes here.

sumstats-liftover 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sumstats-liftover

Features

Installation

Requirements

Quick Start

Usage

Basic Usage

Custom Column Names

Handling Unmapped Variants

Coordinate Systems

Chain Files

API Reference

liftover_df()

Chromosome Name Handling

Testing

Example

How It Works

Building Disjoint Intervals

Performance

License

Contributing

Citation

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`liftover_df()`