Skip to main content

Fast chain-based liftover for pandas DataFrames

Project description

sumstats-liftover

Fast chain-based liftover for pandas DataFrames using UCSC chain files.

A standalone, vectorized implementation for lifting over genomic coordinates in pandas DataFrames. This library provides a fast and efficient way to convert genomic coordinates from one genome build (e.g., hg19/GRCh37) to another (e.g., hg38/GRCh38) using UCSC chain files.

Note: This module is part of GWASLab, a comprehensive Python package for processing and visualizing GWAS summary statistics.

Features

  • Fast: ~1.2M rows/second throughput, 24-25x faster than UCSC liftOver
  • Built-in chain files: Includes commonly used chain files (hg19↔hg38, hg18→hg19)
  • Standalone: No external dependencies on UCSC tools
  • Flexible: Custom column names, 0-based/1-based coordinates
  • Robust: Handles chromosome normalization, special chromosomes, unmapped variants
  • Accurate: 100% agreement with UCSC liftOver for standard chromosomes

Installation

pip install sumstats-liftover

Requirements: Python >= 3.8, numpy >= 1.20.0, pandas >= 1.3.0

Quick Start

import pandas as pd
from sumstats_liftover import liftover_df, get_chain_path

# Create dataframe with genomic coordinates
df = pd.DataFrame({
    'CHR': [1, 1, 2],
    'POS': [725932, 725933, 100000],  # hg19 positions
    'EA': ['G', 'A', 'C'],
    'NEA': ['A', 'G', 'T']
})

# Perform liftover using built-in chain file
result = liftover_df(
    df,
    chain_path=get_chain_path("hg19ToHg38"),
    chrom_col="CHR",
    pos_col="POS"
)

print(result[['CHR', 'POS', 'CHR_LIFT', 'POS_LIFT', 'STRAND_LIFT']])

Usage

Built-in Chain Files

The package includes commonly used chain files:

from sumstats_liftover import get_chain_path, list_chain_files

# List available chain files
list_chain_files()
# {'hg19ToHg38': 'Convert from hg19/GRCh37 to hg38/GRCh38',
#  'hg38ToHg19': 'Convert from hg38/GRCh38 to hg19/GRCh37',
#  'hg18ToHg19': 'Convert from hg18 to hg19/GRCh37'}

# Use built-in chain file
result = liftover_df(df, chain_path=get_chain_path("hg19ToHg38"))

Custom Chain Files

Use your own chain files by providing the path:

result = liftover_df(df, chain_path="/path/to/custom.chain.gz")

UCSC chain files: Download

Note: The parser supports both space-separated and tab-separated chain files, and automatically handles comment headers (lines starting with #) at the beginning of chain files.

Filtering Options

Default behavior matches UCSC liftOver (allows non-standard chromosomes, alternate contigs, inter-chromosomal mappings).

Filter problematic mappings:

# Remove all problematic mappings with one parameter
result = liftover_df(df, chain_path=chain_path, remove=True)

# Or control individually
result = liftover_df(
    df,
    chain_path=chain_path,
    remove_unmapped=True,                    # Remove unmapped variants
    remove_nonstandard_chromosomes=True,    # Filter non-standard chromosomes
    remove_alternative_chromosomes=True,     # Filter alternative contigs
    remove_different_chromosomes=True        # Filter inter-chromosomal mappings
)

Coordinate Systems

# 0-based input/output (BED format)
result = liftover_df(df, chain_path=chain_path, 
                     one_based_input=False, one_based_output=False)

# 1-based input/output (GWAS standard, default)
result = liftover_df(df, chain_path=chain_path,
                     one_based_input=True, one_based_output=True)

Custom Column Names

result = liftover_df(
    df,
    chain_path=chain_path,
    chrom_col="Chromosome",
    pos_col="BP",
    out_chrom_col="CHR_hg38",
    out_pos_col="POS_hg38"
)

Special Chromosomes

By default, special chromosomes (X, Y, M) are kept as strings. Convert to numeric:

result = liftover_df(df, chain_path=chain_path, 
                     convert_special_chromosomes=True)  # X→23, Y→24, M→25

API Reference

liftover_df()

Main function for lifting over genomic coordinates.

Parameters:

Parameter Type Default Description
df pd.DataFrame - DataFrame with genomic coordinates
chain_path str - Path to UCSC chain file
chrom_col str "CHR" Input chromosome column name
pos_col str "POS" Input position column name
out_chrom_col str "CHR_LIFT" Output chromosome column name
out_pos_col str "POS_LIFT" Output position column name
out_strand_col str "STRAND_LIFT" Output strand column name
one_based_input bool True Whether input is 1-based
one_based_output bool True Whether output should be 1-based
remove bool False Remove all problematic mappings (convenience option)
remove_unmapped bool False Remove unmapped variants
remove_nonstandard_chromosomes bool False Filter non-standard chromosomes
remove_alternative_chromosomes bool False Filter alternative contigs
remove_different_chromosomes bool False Filter inter-chromosomal mappings
convert_special_chromosomes bool False Convert X→23, Y→24, M→25
ucsc_compatible bool False Explicit UCSC-compatible mode (redundant with defaults)

Returns: pd.DataFrame with lifted coordinates added as new columns.

Chain File Functions

  • get_chain_path(name) - Get path to built-in chain file
  • list_chain_files() - List all available built-in chain files
  • get_chain_info(name) - Get information about a chain file

Performance

Benchmarks

Dataset Size Time Throughput Memory
1,000 rows ~0.19s ~5,200 rows/s < 10 MB
10,000 rows ~0.19s ~54,000 rows/s < 20 MB
1,000,000 rows ~0.84s ~1,190,000 rows/s ~200 MB
30,000,000 rows ~24s ~1,250,000 rows/s ~2 GB

Key characteristics:

  • Consistent ~1.2M rows/second throughput across all sizes
  • Linear scaling with dataset size
  • Memory efficient: ~60-80 KB per row

Comparison with UCSC liftOver

Tool Throughput Time (1M) Time (30M) Speed
sumstats-liftover ~1.2M rows/s 0.84s ~24s 24-25x faster
UCSC liftOver ~48.6K rows/s 20.58s ~617s Baseline

Accuracy: 100% agreement with UCSC liftOver for standard chromosome mappings (tested on 1M variants).

How It Works

The library builds a disjoint interval cover from UCSC chain files by selecting the highest-scoring segment at each position when overlaps occur. This enables O(log n) coordinate lookup using binary search.

Algorithm:

  1. Parse all alignment segments from chain file
  2. Build disjoint cover: for overlaps, select highest-scoring segment
  3. Create sorted index for fast binary search lookup

Testing

# Run all tests
pytest tests/ -v

# Run performance tests
pytest tests/test_performance.py -v -s

# Run accuracy tests
pytest tests/test_variant_types.py -v

See example.py for usage examples.

License

Package: MIT License (see LICENSE file)

UCSC Chain Files: Built-in chain files are proprietary to The Regents of the University of California:

  • Free for Independent Researchers and Nonprofit Organizations (non-commercial use)
  • Commercial use requires UCSC license
  • EULA | Licensing

Users are responsible for ensuring compliance with UCSC EULA.

Citation

GWASLab (main package):

@article{he2023gwaslab,
  title = {GWASLab: a Python package for processing and visualizing GWAS summary statistics},
  author = {He, Yunye and Koido, Masaru and Shimmori, Yoichi and Kamatani, Yoichiro},
  year = {2023},
  journal = {Jxiv},
  doi = {10.51094/jxiv.370}
}

sumstats-liftover:

@software{sumstats-liftover,
  title = {sumstats-liftover: Fast chain-based liftover for pandas DataFrames},
  author = {He, Yunye},
  year = {2024},
  url = {https://github.com/yourusername/sumstats-liftover},
  note = {Module of GWASLab}
}

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sumstats_liftover-1.1.0.tar.gz (35.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sumstats_liftover-1.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file sumstats_liftover-1.1.0.tar.gz.

File metadata

  • Download URL: sumstats_liftover-1.1.0.tar.gz
  • Upload date:
  • Size: 35.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for sumstats_liftover-1.1.0.tar.gz
Algorithm Hash digest
SHA256 057a0605802ff15cf73b10c7dcba0e8bb83fe0f2a3fdc60f2d32c333f14b63ab
MD5 5dd68f6dc43196914750ea3a9ed3b63f
BLAKE2b-256 95d98f2fcba7107fe27fe8da08c9cc817dbc0313ef0e4d98055acc5565069a37

See more details on using hashes here.

File details

Details for the file sumstats_liftover-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sumstats_liftover-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab2c10b7320b43b1eece49c60407962dc6796ae0f476636a97ed3ca69d15f289
MD5 f7f7b3d74c542911471fdb2313ef27fd
BLAKE2b-256 c60e7afa93d50b729fb95d557cfe5d0ffd6d5006b0642ebe489c07095a207527

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page