Skip to main content

A package for working with Bioinformatics data with SQL and Arrow

Project description

biobear

biobear is a Python library designed for reading and searching bioinformatic file formats, using Rust as its backend and producing Arrow Batch Readers and other downstream formats (like polars or duckdb).

The python package has minimal dependencies and only requires Polars. Biobear can be used to read various bioinformatic file formats, including FASTA, FASTQ, VCF, BAM, and GFF locally or from an object store like S3. It can also query some indexed file formats locally like VCF and BAM.

Release

Please see the documentation for information on how to get started using biobear.

Quickstart

To install biobear, run:

pip install biobear
pip install polars # needed for `to_polars` method

Create a file with some GFF data:

echo "chr1\t.\tgene\t1\t100\t.\t+\t.\tgene_id=1;gene_name=foo" > test.gff
echo "chr1\t.\tgene\t200\t300\t.\t+\t.\tgene_id=2;gene_name=bar" >> test.gff

Then you can use biobear to read a file:

import biobear as bb

session = bb.connect()
df = session.sql("""
    SELECT * FROM gff_scan('test.gff')
""").to_polars()

print(df)

This will print:

┌─────────┬────────┬──────┬───────┬───┬───────┬────────┬───────┬───────────────────────────────────┐
│ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes                        │
│ ---     ┆ ---    ┆ ---  ┆ ---   ┆   ┆ ---   ┆ ---    ┆ ---   ┆ ---                               │
│ str     ┆ str    ┆ str  ┆ i64   ┆   ┆ f32   ┆ str    ┆ str   ┆ list[struct[2]]                   │
╞═════════╪════════╪══════╪═══════╪═══╪═══════╪════════╪═══════╪═══════════════════════════════════╡
│ chr1    ┆ .      ┆ gene ┆ 1     ┆ … ┆ null  ┆ +      ┆ null  ┆ [{"gene_id","1"}, {"gene_name","… │
│ chr1    ┆ .      ┆ gene ┆ 200   ┆ … ┆ null  ┆ +      ┆ null  ┆ [{"gene_id","2"}, {"gene_name","… │
└─────────┴────────┴──────┴───────┴───┴───────┴────────┴───────┴───────────────────────────────────┘

Using a Session w/ Exon

BioBear exposes a session object that can be used with exon to work with files directly in SQL, then eventually convert them to a DataFrame if needed.

See the BioBear Docs for more information, but in short, you can use the session like this:

import biobear as bb

session = bb.connect()

session.sql("""
CREATE EXTERNAL TABLE gene_annotations_s3 STORED AS GFF LOCATION 's3://BUCKET/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("""
    SELECT * FROM gene_annotations_s3 WHERE score > 50
""").to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname      ┆ source          ┆ type ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---          ┆ ---             ┆ ---  ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str          ┆ str             ┆ str  ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2     ┆ … ┆ 54.5       ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 228   ┆ … ┆ 114.0      ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 1097  ┆ … ┆ 224.399994 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2261  ┆ … ┆ 237.699997 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 3784  ┆ … ┆ 114.400002 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Ecosystem

BioBear aims to make it simple to move easily to and from different prominent data tools in Python. Generally, if the tool can read Arrow, it can read BioBear's output. To call out a few examples here:

Polars

The session results and Reader objects can be converted to a Polars DataFrame.

import biobear as bb

session = bb.connect()

df = session.sql("""
    SELECT * FROM gff_scan('test.gff')
""").to_polars()

Known Issues

For GenBank and mzML, the naive SELECT * will cause an error, because Polars doesn't support all Arrow types -- Map being the specific offender here. In these cases, select the fields from the map individually. Alternatively, you can first convert the table to a Pandas DataFrame.

DuckDB

BioBear can also be used to read files into a duckdb database.

import biobear as bb
import duckdb

session = bb.connect()

session.sql("""
    CREATE EXTERNAL TABLE gene_annotations STORED AS GFF LOCATION 'python/tests/data/test.gff'
""")

result = session.sql("""
    SELECT * FROM gene_annotations
""")

gff_table_arrow_table = result.to_arrow()

duckdb_conn = duckdb.connect()

result = duckdb_conn.execute('SELECT * FROM gff_table_arrow_table').fetchall()
print(result)

Performance

Please see the exon's performance metrics for thorough benchmarks, but in short, biobear is generally faster than other Python libraries for reading bioinformatic file formats.

For example, here's quick benchmarks for reading one FASTA file with 1 million records and reading 5 FASTA files each with 1 million records for the local file system on an M1 MacBook Pro:

Library 1 file (s) 5 files (s)
BioBear 4.605 s ± 0.166 s 6.420 s ± 0.113 s
BioPython 6.654 s ± 0.003 s 34.254 s ± 0.053 s

The larger difference multiple files is due to biobear's ability to read multiple files in parallel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

biobear-0.21.1-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

biobear-0.21.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

biobear-0.21.1-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

biobear-0.21.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

biobear-0.21.1-pp38-pypy38_pp73-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

biobear-0.21.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

biobear-0.21.1-cp312-cp312-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARM64

biobear-0.21.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

biobear-0.21.1-cp311-cp311-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

biobear-0.21.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

biobear-0.21.1-cp310-cp310-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

biobear-0.21.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

biobear-0.21.1-cp310-cp310-macosx_11_0_arm64.whl (17.9 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

biobear-0.21.1-cp39-cp39-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARM64

biobear-0.21.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

biobear-0.21.1-cp38-cp38-manylinux_2_28_aarch64.whl (19.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARM64

biobear-0.21.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

File details

Details for the file biobear-0.21.1-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3b0232dbf90ba2445cbd8080cee75f12583bda44d0abdc91bb8c19665b204605
MD5 63fbbeabc1442b17e5125cb499472dfd
BLAKE2b-256 55be11a8a00a52fe811d7d632eeeb848f0967d0c2fd1da77f206983a55baeaff

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bbd88c04c777726dc9ac058f486528bb3723943a96d3e504d539dc3fb2c4930b
MD5 4e1acf60f814ed91bfeb427bb62809a2
BLAKE2b-256 c31665da130d81a63e97cf79d30995898f77ab67b255c4f88eb1e04d1c1dc0c8

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 87e4d0b375c5c2767dd6ca748205280f07d548a4298203c72a97d8f655b675b4
MD5 b887c89e80ea5765eeaa881beb303838
BLAKE2b-256 3c6cf7ed4e9c2fb398463790339358c664419468da4530d28596229d58eadb60

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 469fb2b027eda4412f729166562f8227cce5d9e68816a6aeabd1aaae607c88b6
MD5 58e673e15c390ddecf50eb4d50d9696c
BLAKE2b-256 a7d00f9ed2b93bf99a7a4948f0abe3553ce51b69c232c71b048243bddf2d3475

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-pp38-pypy38_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-pp38-pypy38_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e0da38fb1c3bf5a1fd00ca53e4c1b673e08bb6360afe27bbb8e9f9d7e5af8b83
MD5 f7f402498bea87d0fad7fac7288b5e81
BLAKE2b-256 558da0936e220b6b7898d2ca592b01f37c126373c106e5e2b3dfd0187f5aed56

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ae9696b4890a2b8e7b24138aa54dbd1398c9fdad5aeac8ecf498e8bf99f9f380
MD5 7c0e3b1fab2125c452db4e9b22ee9a7b
BLAKE2b-256 af75a706907acfc8dfedaef504240d8026879e0d0456249c79148f0177fca7a4

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e4638fd9b2f8bd766798bad01011fb98429b96d80b96bf68c1c16b98609ac65b
MD5 a9aa7182433a984e242e82716f6dec04
BLAKE2b-256 8e27f03674e0aa0f8f8219b151a68338fa079d01e0a628688abf1b5838e23b6c

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4e0bd29126fb18196a17ecc36e2dc47efe0932e2c73365b7e32f3ec91e455fbb
MD5 d1e8544129d5cc180c1ce3049ec67aec
BLAKE2b-256 65e86c496fba28f3ae0d70ca392864cc8d2efd5ff9ee63b90d6fe7e1e87140f5

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1a464fa3acc19daa5d0bf35f5f6ac0090eb7adcd9899caf00a4a3487c6bd5178
MD5 894e075b2dc19e1d99dd3e0d7b8e39a1
BLAKE2b-256 591e62ee76d3c7feadd1ea0cd518b836e92508b14c7f9fd27840ecbde3ab4657

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3028e8419d9e169675dacbe9849d18fd2315ed9770f0705afbe3af4d5180eb37
MD5 4ac1490547581367169d8d045db334e7
BLAKE2b-256 9cec13349c7eb6141055dea5a6062c0fee1d66c0acb227cd328428f6850a5100

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0c4ad8dd8804e218faa6f1285ff0587d255aa095808770789bf12ce7c433b827
MD5 263d5b929d77eab34b8de3840899135b
BLAKE2b-256 5a544547ac6d9efafd4cbe95e79c0675e3109bc5e8dfa93fe32eeafced5b973d

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cf9381fbf7672e0995ee352dca0b5f83a7bc62c454bafedc82ccb5735617cb73
MD5 ad43a5218a81ff092702fb81e638a00c
BLAKE2b-256 4fdc02e8a8138dfadd207ce6425c7ae9027f7b6a260a865a92d29f1e8c6cad51

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6831ddbfcb01f153d967b63ad1693105a905778ce3c6471d5423473da862758d
MD5 68746823eb848ce650d59373b24416dc
BLAKE2b-256 9fe247569bebd77ab78229b0fc3ccc116e14a94b65940048d4eda025f2f63910

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp39-cp39-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b4f25219a906a83fa99e8efe8fc161c18780cad170fc085831cddafa09290be8
MD5 09adfb8d0e7408246562e828e712aeee
BLAKE2b-256 06637e01dda747da5fca901d8a66702e2060f5dab270fda7b63a54a2dad20a43

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 50775f8b2431be18eaba39a6343a935735901ac711fe59eae61b45cab422430c
MD5 c67f18e4985caf363aa60916d0304c48
BLAKE2b-256 0613706fdb9a55c467536bf9dd87210462c4486ca0eefa88dd2e79996e891fce

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp38-cp38-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 9556c2e887ddbc54bc75d98f09f080f017e4bf403df9c37e4332520017a74568
MD5 25f310f8b9c20bbc9d84f59331aee782
BLAKE2b-256 144bec33238f76d4229da37a0b95c5a081615f6862939285ab72343d57e3f1ec

See more details on using hashes here.

File details

Details for the file biobear-0.21.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for biobear-0.21.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee2eade209d46d877942baba515b09b3c905863b9fd83575e152c258f3297331
MD5 53f1f0f0d218bb117a82a58f677f955b
BLAKE2b-256 6e27dbe53c62d352a0d20249da26e5e68df45853f6a2b92843688e88cfcda0e5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page