Skip to main content

A database to quickly read and write DNA sequence data in numerical form.

Project description

testing badge coverage badge docs badge black badge git3moji badge

SeqBank is a powerful and flexible command-line application designed to simplify the management and processing of large DNA sequence datasets. Whether you’re working with local sequence files, retrieving data from remote URLs, or integrating sequences from databases like RefSeq and DFam, SeqBank provides an efficient, user-friendly solution.

SeqBank allows users to quickly add, organize, and manipulate sequences using a structured, numerical format optimized for fast retrieval and analysis. It’s especially useful for bioinformatics professionals who regularly handle vast amounts of genomic data.

Installation

To install the latest version from the repository, you can use this command:

pip install git+https://github.com/rbturnbull/seqbank.git

Usage

SeqBank provides a command-line interface (CLI) for managing DNA sequence data efficiently. Below are the main tools, along with examples of how to use them in practical workflows.

Adding Sequences

SeqBank allows you to import sequence data from files or URLs into the database. The system supports multiple sequence formats, providing flexibility in handling various datasets.

Example:

To add sequences from one or more local files:

seqbank add /path/to/seqbank /path/to/sequence1.fasta /path/to/sequence2.fasta --format fasta

To add sequences from a list of URLs:

seqbank url /path/to/seqbank https://example.com/sequence1.fasta https://example.com/sequence2.fasta --format fasta --workers 4

Use case: Suppose you have a new set of genome sequences in FASTA format stored locally or accessible via URLs. You can quickly import these sequences into your SeqBank database for centralized storage and further analysis.

Managing Databases

SeqBank provides commands to manage and query the sequences in your database. You can list, count, and delete sequences, allowing efficient database management.

Example:

To list all sequences in the database:

seqbank ls /path/to/seqbank

To count the number of sequences stored:

seqbank count /path/to/seqbank

To delete a specific sequence by accession number:

seqbank delete /path/to/seqbank ABC123DEF456

Use case: If you’re managing a growing sequence database, the ls command can help you track the sequences, while delete can be used to remove outdated or incorrect entries.

Exporting Sequences

You can export your stored sequences to common formats like FASTA for easy sharing and use with other bioinformatics tools. This ensures compatibility with external platforms.

Example:

To export sequences in FASTA format to a specific output directory:

seqbank export /path/to/seqbank /output/directory --format fasta

Use case: After storing a collection of curated sequences, you may need to export them in FASTA format for downstream analysis using tools like BLAST or multiple sequence alignment software.

Integration with RefSeq and DFam

SeqBank integrates with popular genomic databases like RefSeq and DFam, allowing users to download and incorporate sequences from these sources.

Example:

To download and add RefSeq sequences with a maximum of 1000 sequences using 4 workers:

seqbank refseq /path/to/seqbank --max 1000 --workers 4

To download and add DFam sequences from the current release with curated data:

seqbank dfam /path/to/seqbank --release current --curated

Use case: If you are studying repetitive elements in a genome, you can easily integrate sequences from DFam into your SeqBank database for comprehensive analysis.

Visualization of Sequence Data

SeqBank includes built-in functionality for generating histograms of sequence lengths, providing a visual summary of the data.

Example:

To generate and save a histogram of sequence lengths:

seqbank histogram /path/to/seqbank --output histogram.png --nbins 50

To generate and display the histogram interactively:

seqbank histogram /path/to/seqbank --show --nbins 50

Use case: When working with a dataset of varying sequence lengths, generating a histogram can help visualize the distribution and detect outliers or inconsistencies in the data.

Copying Databases

SeqBank allows you to copy sequences from one SeqBank database to another, facilitating data migration or backup processes.

Example:

To copy sequences from a source SeqBank to a destination SeqBank:

seqbank cp /path/to/source_seqbank /path/to/destination_seqbank

Use case: For maintaining backups of your sequence database or migrating data to a new location, the cp command provides a straightforward method to duplicate your SeqBank data.

Filtering Sequences and Custom Workflows

SeqBank supports filtering sequences based on criteria such as sequence length or file format before adding them to the database. Additionally, multi-threaded downloading allows you to download and process sequences more efficiently.

Example:

To filter sequences longer than 1000 bp before adding them:

seqbank add /path/to/seqbank /path/to/sequences.fasta --format fasta --filter /path/to/filter_file

To enable multi-threaded downloading when adding sequences from URLs:

seqbank url /path/to/seqbank https://example.com/sequence1.fasta https://example.com/sequence2.fasta --format fasta --workers 4 --tmp-dir /path/to/tmp

Use case: In projects where only sequences longer than a specific threshold are required, the filtering feature ensures that only relevant sequences are stored. Multi-threaded downloading can be utilized when processing large datasets to save time.

Credits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqbank-0.1.2.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

seqbank-0.1.2-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file seqbank-0.1.2.tar.gz.

File metadata

  • Download URL: seqbank-0.1.2.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.6.0

File hashes

Hashes for seqbank-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2acb27c5067a8a0ad07f06c5edb59b043abe7db5b6f37bf405fe6e1fbcb19ba4
MD5 cf235b0264dd48e02f8838a3656fb191
BLAKE2b-256 0972f7afbfca48737a962bd37ae974c480e0b81db5a2ec6d5025a49531f170db

See more details on using hashes here.

File details

Details for the file seqbank-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: seqbank-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.6.0

File hashes

Hashes for seqbank-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 01c6c853dedc6b4dd8314c89fa344b2e0066ee78709d38732b90d48a6d60d773
MD5 00667513dbb48974695b6a2753ddae15
BLAKE2b-256 00fdaaafb4ecfcf785066dec22f697956d7f1668738eb5468aaa348312a7d17d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page