Skip to main content

An alternative to pyBigWig for bedgraph files

Project description

pyBedGraph

pyBedGraph is an alternative to pyBigWig for bedGraph files.

Features:

  • Finds the mean, approximate mean, median, max, min, coverage, or standard deviation for an interval

Improvements:

  • Much faster (>100x)

Downsides:

  • Uses much more memory
    • 16 bytes per line in bedGraph file
    • 8 bytes per basePair in chromosome loaded
  • Loading the bedGraph file takes a few minutes if it is large

Usage:

Create the object:

from pyBedGraph import BedGraph

# arg1 - chromosome sizes file
# arg2 - bedgraph file
# arg3 - (optional) chromosome_name
# Just load chromosome 'chr1' (uses less memory and takes less time)
bedGraph = BedGraph('hg38.chrom.sizes', 'ENCFF321FZQ.bedGraph', 'chr1')

# Load the whole bedGraph file
bedGraph = BedGraph('hg38.chrom.sizes', 'ENCFF321FZQ.bedGraph')

# Option to not ignore missing basePairs when calculating statistics
bedGraph = BedGraph('hg38.chrom.sizes', 'ENCFF321FZQ.bedGraph', ignore_missing_bp=False)

Choose a specific statistic:

  • 'mean'
  • 'approx_mean' - an approximate mean that is around 5x faster for a 0-1% error
  • 'median' - Much slower (>>10x) due to not being implemented in Cython
  • 'max'
  • 'min'
  • 'coverage'
  • 'std' - 10x slower than the rest of the stats

Choose and load a chromosome to search for:

bedGraph.load_chrom_data('chr1')

Load bins for finding mean:

For mean:

  • sqrt(interval size)

For approx_mean:

  1. Smaller bin size -> more accurate but slower
  2. Larger bin size -> less accurate but faster
bedGraph.load_chrom_bins('chr1', 100)

Search from a file:

# arg1 - interval file
# arg2 - (optional) output_to_file (default is True and outputs to 'chr1_out.txt'
# arg3 - (optional) stat (default is 'mean')
# returns a dictionary; keys are chromosome names, values are numpy arrays
result = bedGraph.stats_from_file('intervals_to_search_for.txt', output_to_file=False, stat='mean')

Search from a list of intervals:

import numpy as np

# Option 1
intervals = [
    ['chr1', 0, 1000],
    ['chr1', 1001, 1500],
    ['chr1', 2000, 2200],
    ['chr1', 3000, 5000],
    ['chr1', 5001, 10000],
    ['chr1', 100000, 101000]
]

# Option 2
start_list = np.array([0, 1001, 2000, 3000, 5001, 100000], dtype=np.int32)
end_list = np.array([1000, 1500, 2200, 5000, 10000, 101000], dtype=np.int32)
chrom_name = 'chr1'

# arg1 - (optional) stat (default is 'mean')
# arg2 - intervals
# arg3 - start_list
# arg4 - end_list
# arg5 - chrom_name
# must have either intervals or start_list, end_list, chrom_name
# returns a numpy array of values
values = bedGraph.stats(intervals=intervals)

values = bedGraph.stats(start_list=start_list, end_list=end_list, chrom_name=chrom_name)

# Output is [0.         0.         0.         0.         0.00207475 0.05981362]
print(values)

Benchmark:

Actual values are found from the stats function in pyBigWig with the exact argument being True. The error for exact stats will be ~1e-8 due to rounding error of conversion of bigWig and bedGraph files.

Alternatively, one can make actual values be from pyBedGraph.

from pyBedGraph import Benchmark

# arg1 - BedGraph object
# arg2 - bigwig file
bedGraph = BedGraph('P2MC7N8HCE3K.bedgraph')
bench = Benchmark(bedGraph, 'P2MC7N8HCE3K.bw')

# arg1 - num_tests
# arg2 - interval_size
# arg3 - chrom_nam
# arg4 - bin_size
# arg5 - stats (optional) (Default is all stats)
# arg6 - only_runtime (optional) (Default is False)
# arg6 - bench_pyBigWig (optional) (Default is True)
# arg6 - pyBigWig_baseline (optional) (Default is True)
result = bench.benchmark(num_test, interval_size, chrom_name, bin_size, stats='mean')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyBedGraph-0.4.1.tar.gz (345.8 kB view details)

Uploaded Source

File details

Details for the file pyBedGraph-0.4.1.tar.gz.

File metadata

  • Download URL: pyBedGraph-0.4.1.tar.gz
  • Upload date:
  • Size: 345.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for pyBedGraph-0.4.1.tar.gz
Algorithm Hash digest
SHA256 67ebec86f46835afe09dbd62f9616735a936f03b90824a8098d3220b08fc6e5d
MD5 63fbdab3338d3f40edd903f475fce9ab
BLAKE2b-256 5af06aea1ea89e9385bcac8caa8e25f01e5ee2c420ea336759e7630cc127d76f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page