Fast chain-based liftover for pandas DataFrames
Project description
sumstats-liftover
Fast chain-based liftover for pandas DataFrames using UCSC chain files.
A standalone, vectorized implementation for lifting over genomic coordinates in pandas DataFrames. This library provides a fast and efficient way to convert genomic coordinates from one genome build (e.g., hg19/GRCh37) to another (e.g., hg38/GRCh38) using UCSC chain files.
Note: This module is part of GWASLab, a comprehensive Python package for processing and visualizing GWAS summary statistics.
Features
- Fast: ~1.2M rows/second throughput, 24-25x faster than UCSC liftOver
- Built-in chain files: Includes commonly used chain files (hg19↔hg38, hg18→hg19)
- Standalone: No external dependencies on UCSC tools
- Flexible: Custom column names, 0-based/1-based coordinates
- Robust: Handles chromosome normalization, special chromosomes, unmapped variants
- Accurate: 100% agreement with UCSC liftOver for standard chromosomes
Installation
pip install sumstats-liftover
Requirements: Python >= 3.8, numpy >= 1.20.0, pandas >= 1.3.0
Quick Start
import pandas as pd
from sumstats_liftover import liftover_df, get_chain_path
# Create dataframe with genomic coordinates
df = pd.DataFrame({
'CHR': [1, 1, 2],
'POS': [725932, 725933, 100000], # hg19 positions
'EA': ['G', 'A', 'C'],
'NEA': ['A', 'G', 'T']
})
# Perform liftover using built-in chain file
result = liftover_df(
df,
chain_path=get_chain_path("hg19ToHg38"),
chrom_col="CHR",
pos_col="POS"
)
print(result[['CHR', 'POS', 'CHR_LIFT', 'POS_LIFT', 'STRAND_LIFT']])
Usage
Built-in Chain Files
The package includes commonly used chain files:
from sumstats_liftover import get_chain_path, list_chain_files
# List available chain files
list_chain_files()
# {'hg19ToHg38': 'Convert from hg19/GRCh37 to hg38/GRCh38',
# 'hg38ToHg19': 'Convert from hg38/GRCh38 to hg19/GRCh37',
# 'hg18ToHg19': 'Convert from hg18 to hg19/GRCh37'}
# Use built-in chain file
result = liftover_df(df, chain_path=get_chain_path("hg19ToHg38"))
Custom Chain Files
Use your own chain files by providing the path:
result = liftover_df(df, chain_path="/path/to/custom.chain.gz")
UCSC chain files: Download
Note: The parser supports both space-separated and tab-separated chain files, and automatically handles comment headers (lines starting with #) at the beginning of chain files.
Filtering Options
Default behavior matches UCSC liftOver (allows non-standard chromosomes, alternate contigs, inter-chromosomal mappings).
Filter problematic mappings:
# Remove all problematic mappings with one parameter
result = liftover_df(df, chain_path=chain_path, remove=True)
# Or control individually
result = liftover_df(
df,
chain_path=chain_path,
remove_unmapped=True, # Remove unmapped variants
remove_nonstandard_chromosomes=True, # Filter non-standard chromosomes
remove_alternative_chromosomes=True, # Filter alternative contigs
remove_different_chromosomes=True # Filter inter-chromosomal mappings
)
Coordinate Systems
# 0-based input/output (BED format)
result = liftover_df(df, chain_path=chain_path,
one_based_input=False, one_based_output=False)
# 1-based input/output (GWAS standard, default)
result = liftover_df(df, chain_path=chain_path,
one_based_input=True, one_based_output=True)
Custom Column Names
result = liftover_df(
df,
chain_path=chain_path,
chrom_col="Chromosome",
pos_col="BP",
out_chrom_col="CHR_hg38",
out_pos_col="POS_hg38"
)
Special Chromosomes
By default, special chromosomes (X, Y, M) are kept as strings. Convert to numeric:
result = liftover_df(df, chain_path=chain_path,
convert_special_chromosomes=True) # X→23, Y→24, M→25
API Reference
liftover_df()
Main function for lifting over genomic coordinates.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pd.DataFrame | - | DataFrame with genomic coordinates |
chain_path |
str | - | Path to UCSC chain file |
chrom_col |
str | "CHR" |
Input chromosome column name |
pos_col |
str | "POS" |
Input position column name |
out_chrom_col |
str | "CHR_LIFT" |
Output chromosome column name |
out_pos_col |
str | "POS_LIFT" |
Output position column name |
out_strand_col |
str | "STRAND_LIFT" |
Output strand column name |
one_based_input |
bool | True |
Whether input is 1-based |
one_based_output |
bool | True |
Whether output should be 1-based |
remove |
bool | False |
Remove all problematic mappings (convenience option) |
remove_unmapped |
bool | False |
Remove unmapped variants |
remove_nonstandard_chromosomes |
bool | False |
Filter non-standard chromosomes |
remove_alternative_chromosomes |
bool | False |
Filter alternative contigs |
remove_different_chromosomes |
bool | False |
Filter inter-chromosomal mappings |
convert_special_chromosomes |
bool | False |
Convert X→23, Y→24, M→25 |
ucsc_compatible |
bool | False |
Explicit UCSC-compatible mode (redundant with defaults) |
Returns: pd.DataFrame with lifted coordinates added as new columns.
Chain File Functions
get_chain_path(name)- Get path to built-in chain filelist_chain_files()- List all available built-in chain filesget_chain_info(name)- Get information about a chain file
Performance
Benchmarks
| Dataset Size | Time | Throughput | Memory |
|---|---|---|---|
| 1,000 rows | ~0.19s | ~5,200 rows/s | < 10 MB |
| 10,000 rows | ~0.19s | ~54,000 rows/s | < 20 MB |
| 1,000,000 rows | ~0.84s | ~1,190,000 rows/s | ~200 MB |
| 30,000,000 rows | ~24s | ~1,250,000 rows/s | ~2 GB |
Key characteristics:
- Consistent ~1.2M rows/second throughput across all sizes
- Linear scaling with dataset size
- Memory efficient: ~60-80 KB per row
Comparison with UCSC liftOver
| Tool | Throughput | Time (1M) | Time (30M) | Speed |
|---|---|---|---|---|
| sumstats-liftover | ~1.2M rows/s | 0.84s | ~24s | 24-25x faster |
| UCSC liftOver | ~48.6K rows/s | 20.58s | ~617s | Baseline |
Accuracy: 100% agreement with UCSC liftOver for standard chromosome mappings (tested on 1M variants).
How It Works
The library builds a disjoint interval cover from UCSC chain files by selecting the highest-scoring segment at each position when overlaps occur. This enables O(log n) coordinate lookup using binary search.
Algorithm:
- Parse all alignment segments from chain file
- Build disjoint cover: for overlaps, select highest-scoring segment
- Create sorted index for fast binary search lookup
Testing
# Run all tests
pytest tests/ -v
# Run performance tests
pytest tests/test_performance.py -v -s
# Run accuracy tests
pytest tests/test_variant_types.py -v
See example.py for usage examples.
License
Package: MIT License (see LICENSE file)
UCSC Chain Files: Built-in chain files are proprietary to The Regents of the University of California:
- Free for Independent Researchers and Nonprofit Organizations (non-commercial use)
- Commercial use requires UCSC license
- EULA | Licensing
Users are responsible for ensuring compliance with UCSC EULA.
Citation
GWASLab (main package):
@article{he2023gwaslab,
title = {GWASLab: a Python package for processing and visualizing GWAS summary statistics},
author = {He, Yunye and Koido, Masaru and Shimmori, Yoichi and Kamatani, Yoichiro},
year = {2023},
journal = {Jxiv},
doi = {10.51094/jxiv.370}
}
sumstats-liftover:
@software{sumstats-liftover,
title = {sumstats-liftover: Fast chain-based liftover for pandas DataFrames},
author = {He, Yunye},
year = {2024},
url = {https://github.com/yourusername/sumstats-liftover},
note = {Module of GWASLab}
}
Links
- GWASLab - Main package
- GitHub Repository
- UCSC Genome Browser
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sumstats_liftover-1.1.0.tar.gz.
File metadata
- Download URL: sumstats_liftover-1.1.0.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
057a0605802ff15cf73b10c7dcba0e8bb83fe0f2a3fdc60f2d32c333f14b63ab
|
|
| MD5 |
5dd68f6dc43196914750ea3a9ed3b63f
|
|
| BLAKE2b-256 |
95d98f2fcba7107fe27fe8da08c9cc817dbc0313ef0e4d98055acc5565069a37
|
File details
Details for the file sumstats_liftover-1.1.0-py3-none-any.whl.
File metadata
- Download URL: sumstats_liftover-1.1.0-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab2c10b7320b43b1eece49c60407962dc6796ae0f476636a97ed3ca69d15f289
|
|
| MD5 |
f7f7b3d74c542911471fdb2313ef27fd
|
|
| BLAKE2b-256 |
c60e7afa93d50b729fb95d557cfe5d0ffd6d5006b0642ebe489c07095a207527
|