Fast chain-based liftover for pandas DataFrames
Project description
sumstats-liftover
Fast chain-based liftover for pandas DataFrames using UCSC chain files.
A standalone, vectorized implementation for lifting over genomic coordinates in pandas DataFrames. This library provides a fast and efficient way to convert genomic coordinates from one genome build (e.g., hg19/GRCh37) to another (e.g., hg38/GRCh38) using UCSC chain files.
Features
- Fast and vectorized: Optimized for large datasets with efficient numpy-based operations
- Standalone: No external dependencies on UCSC tools or other liftover libraries
- Flexible: Supports custom column names and coordinate systems (0-based or 1-based)
- Robust: Handles chromosome name normalization, special chromosomes, and unmapped variants
- Easy to use: Simple pandas DataFrame interface
Installation
pip install sumstats-liftover
Or install from source:
git clone https://github.com/yourusername/sumstats-liftover.git
cd sumstats-liftover
pip install -e .
Requirements
- Python >= 3.8
- numpy >= 1.20.0
- pandas >= 1.3.0
Quick Start
import pandas as pd
from sumstats_liftover import liftover_df
# Create a dataframe with genomic coordinates
df = pd.DataFrame({
'CHR': [1, 1, 2],
'POS': [725932, 725933, 100000], # hg19 positions
'EA': ['G', 'A', 'C'],
'NEA': ['A', 'G', 'T']
})
# Perform liftover from hg19 to hg38
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
chrom_col="CHR",
pos_col="POS"
)
print(result)
Usage
Basic Usage
import pandas as pd
from sumstats_liftover import liftover_df
# Your dataframe with genomic coordinates
df = pd.DataFrame({
'SNPID': ['1:725932_G_A', '1:725933_A_G', '1:737801_T_C'],
'CHR': [1, 1, 1],
'POS': [725932, 725933, 737801], # hg19 positions
})
# Lift over coordinates
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
chrom_col="CHR",
pos_col="POS"
)
# Result includes original columns plus:
# - CHR_LIFT: Lifted chromosome
# - POS_LIFT: Lifted position
# - STRAND_LIFT: Strand information ("+" or "-")
Custom Column Names
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
chrom_col="Chromosome",
pos_col="BP",
out_chrom_col="CHR_hg38",
out_pos_col="POS_hg38",
out_strand_col="STRAND_hg38"
)
Handling Unmapped Variants
By default, unmapped variants are kept with POS_LIFT = -1 and CHR_LIFT = None. To remove them:
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
remove_unmapped=True
)
Coordinate Systems
The library supports both 0-based (BED format) and 1-based (GWAS standard) coordinates:
# For 0-based input coordinates
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
one_based_input=False,
one_based_output=False
)
Chain Files
UCSC chain files can be downloaded from the UCSC Genome Browser. Common chain files include:
hg19ToHg38.over.chain.gz- Convert from hg19 to hg38hg38ToHg19.over.chain.gz- Convert from hg38 to hg19hg18ToHg19.over.chain.gz- Convert from hg18 to hg19
API Reference
liftover_df()
Main function for lifting over genomic coordinates in a pandas DataFrame.
Parameters:
df(pd.DataFrame): DataFrame containing genomic coordinateschain_path(str): Path to UCSC chain file (.chainor.chain.gz)chrom_col(str, default="CHR"): Column name for chromosomepos_col(str, default="POS"): Column name for positionout_chrom_col(str, default="CHR_LIFT"): Output column name for lifted chromosomeout_pos_col(str, default="POS_LIFT"): Output column name for lifted positionout_strand_col(str, default="STRAND_LIFT"): Output column name for lifted strandone_based_input(bool, default=True): Whether input positions are 1-basedone_based_output(bool, default=True): Whether output positions should be 1-basedremove_unmapped(bool, default=False): Remove variants that fail to mapconvert_special_chromosomes(bool, default=True): Convert X→23, Y→24, M/MT→25
Returns:
pd.DataFrame: DataFrame with lifted coordinates added as new columns
Example:
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
chrom_col="CHR",
pos_col="POS"
)
Chromosome Name Handling
The library automatically handles various chromosome name formats:
- Input formats:
1,chr1,X,chrX,23(for X),24(for Y),25(for M/MT) - Output format: By default, special chromosomes are converted to numeric values:
- X → 23
- Y → 24
- M/MT → 25
To keep special chromosomes as strings:
result = liftover_df(
df,
chain_path="hg19ToHg38.over.chain.gz",
convert_special_chromosomes=False
)
Testing
Run the test suite:
pytest test_liftover_df.py -v
Example
See example.py for a complete example demonstrating liftover with a real dataset:
python example.py
How It Works
Building Disjoint Intervals
UCSC chain files often contain overlapping segments from different alignment chains. To enable fast and unambiguous coordinate lookup, this library builds a disjoint interval cover that selects the highest-scoring segment at each position.
The Problem:
- Chain files contain multiple segments that can overlap at the same genomic positions
- Each position needs to map to exactly one target coordinate
- We need to choose which segment to use when overlaps occur
The Solution: The library uses a sweep-line algorithm to build non-overlapping (disjoint) intervals:
- Parse segments: Extract all alignment segments from the chain file
- Build disjoint cover: For overlapping regions, select the segment with the highest score
- Create index: Build a sorted array of disjoint intervals for O(log n) lookup
Example: If we have overlapping segments:
- Segment A: [100, 200) with score 1000
- Segment B: [150, 250) with score 2000
- Segment C: [300, 400) with score 500
The disjoint cover becomes:
- [100, 150) → Segment A (only A covers this region)
- [150, 250) → Segment B (B has higher score than A in overlap)
- [300, 400) → Segment C (no overlap)
This ensures each position maps to exactly one target coordinate, enabling fast binary search lookup.
Performance
This implementation is optimized for large datasets and uses vectorized numpy operations for fast coordinate conversion. The disjoint interval index enables O(log n) coordinate lookup, making it typically faster than the original UCSC liftover tool for batch processing of large DataFrames.
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Citation
If you use this library in your research, please cite:
@software{sumstats-liftover,
title = {sumstats-liftover: Fast chain-based liftover for pandas DataFrames},
author = {Yunye He},
year = {2025},
url = {https://github.com/yourusername/sumstats-liftover}
}
Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sumstats_liftover-0.2.0.tar.gz.
File metadata
- Download URL: sumstats_liftover-0.2.0.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c98439d4ab73be4c28ceebcce9d32936da08505a74291e91496d978547454501
|
|
| MD5 |
8990bc0d3fdc357c396a9fe200e4bb9d
|
|
| BLAKE2b-256 |
90d686c86c969cab343d8613de562643aebb985e89cb9cd44bc0f8789bdbc0bb
|
File details
Details for the file sumstats_liftover-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sumstats_liftover-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ee74475ef32bb9abc19f1e8207ed6dafc284bf59c1af2ce449f9f4ab9bbea1b
|
|
| MD5 |
5d8fafb96f9635f642abfe1b49a33783
|
|
| BLAKE2b-256 |
fb5c133b5da8a6b8127d4e2e4b51b0784a19d2147ee00640d3c2f3b02c004066
|