Skip to main content

No project description provided

Project description

Allows very fast searching of a bed file of any size by gene/snp location.

For example:

from bed_lookup import BedFile
b = BedFile('my_bed.bed')
gene = b.lookup('chr3', 1000104)

This module requires cython, and should work with recent versions of python2 and python3.

It can also be used with a pandas dataframe directly:

df['new_col'] = b.lookup_df(df, 'chrom', 'pos')

Note: with large dataframes, this function can be very slow, but there is a nice trick to speed it up:

import numpy as np
import pandas as pd
from multiprocessing import Pool, cpu_count
pool = Pool()
b    = BedFile('my_bed.bed')
df   = pd.read_csv('big_table.txt.gz', sep='\t', compression='gzip')
dfs  = np.array_split(df, cpu_count())
run  = []
out  = []
# Our chromsome column is 'chrom' and position column is 'pos'.
for d in dfs:
    run.append(pool.apply_async(b.lookup_df, (d, 'chrom', 'pos')))
for r in run:
    out.append(r.get())
df['new_col'] = pd.concat(out)

Installation

Installation follows the standard python syntax:

git clone https://github.com/MikeDacre/python_bed_lookup
cd python_bed_lookup
python setup.py build
sudo python setup.py install

If you do not have root permission on you device, replace the last line with:

python setup.py install --user

Running from the command line

There is a command line script called bed_location_lookup that will be installed in /usr/bin if you install globally or in ~/.local/usr/bin if you install for your user only. The sytax for that script is:

bed_location_lookup <bed_file> chr1_1000134 chr2_1859323 ....

It will work for any number of gene coordinate arguments. Be aware, that there is a file opening delay when the script is run (for small bed files this will be very small, but for large files it can be a few seconds). It is therefore much more efficient to call a single instance of bed_location_lookup with a long list of coordinates than it is to call it once per coordinate. For a large number of coordinates this difference can be substantial.

bed_location_lookup has a few other options also, to get those run:

bed_location_lookup -h

Note: if you know the bed file is large and a database already exists, you can get considerable speed up by passing the database file instead of the raw bed file. e.g. pass bedfile.bed.db instead of bedfile.bed. This bypasses the file length check.

Backend information and customization

It makes use of a cython optimized dictionary lookup for small bed files and sqlite for larger bed files. Which backend is being used is transparent to the user, simply use the lookup() function as demonstrated in the example above. The default file size cutoff is ~5 million lines in the bed file, which results in a memory use of 1.2GB for a 5 million line long file. The memory use scales linearly, so setting the limit at 1 million lines will result in about 240MB of memory use. To change the file size cutoff edit the _max_len variable in bed_lookup/__init__.py. Be aware that the file size limit is actually measured in bytes, for speed purposes. A dictionary of size to file length maps is provided in the __init__.py file, the default should work fine on most systems.

Note that the sqlite backed is very slightly slower for lookups, however the sqlite backend requires that a database exists already. If one does not exist (the expected name is the bed file name followed by a .db) already then one is created, and this step can be very slow. Hypothetically this should only be done once.

As noted above, when creating a BedFile object, a file length lookup is performed. This lookup can be costly, particularly for gzipped files. To skip this step, simply pass the database file to BedFile(), instead of the bedfile itself.

Note: this code will work with either plain text or gzipped files, gzipped files will be slightly slower at load due to the overhead of decompression. For large files where an sqlite database already exists, there will be only a very slight delay relative to the uncompressed bed file (due to file length counting).

As the BedFile object is only generated once, any lookups after the creation of this object will be very fast (less than a second) for any length of bed file. Smaller files will obvious result in even quicker lookups.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_bed_lookup-0.1.0.tar.gz (86.4 kB view details)

Uploaded Source

Built Distribution

python_bed_lookup-0.1.0-cp39-cp39-macosx_11_0_arm64.whl (141.6 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

File details

Details for the file python_bed_lookup-0.1.0.tar.gz.

File metadata

  • Download URL: python_bed_lookup-0.1.0.tar.gz
  • Upload date:
  • Size: 86.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for python_bed_lookup-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6477563cf300a9a3ee9109d1d4d2871b8b2491243df3e17947d8436960c30e2c
MD5 a8f760e64d1ccd594aa20207650d3655
BLAKE2b-256 9fd26ab24b38d15030cddf7fdb9095b4f6c742cab30a00a0dc0bbab4655a1582

See more details on using hashes here.

File details

Details for the file python_bed_lookup-0.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for python_bed_lookup-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dd2ff654e89dcb8cd0a462526e045508640af1e72eb902e1b81859c99eb735d0
MD5 39e36fceb4844b54ae51d558476a723e
BLAKE2b-256 05c2b5c92454692d030175e3439af2680f99f5e689453a45c935f47167b739e6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page