Skip to main content

A bioinformatics data wrangling package for FASTA, FASTQ, VCF, and GFF files.

Project description

Bio-Wrangler

Bio-Wrangler is a bioinformatics data wrangling package for handling FASTA, FASTQ, VCF, and GFF files. It helps load, filter, merge, and summarize biological datasets in an easy and efficient manner.

Features

  • Load FASTA, FASTQ, VCF, and GFF files into pandas DataFrames.
  • Filter data by quality, chromosome, position, and specific attributes.
  • Merge and summarize datasets.
  • Save data to CSV or Excel formats.

Installation

You can install Bio-Wrangler directly from PyPI:

pip install bio-wrangler

Usage

Here’s how to use Bio-Wrangler to load, filter, and manipulate your bioinformatics datasets.

Loading Data

You can load data from FASTA, FASTQ, VCF, and GFF formats into pandas DataFrames for easy manipulation.

Example: Loading FASTA, FASTQ, VCF, and GFF Files

from bio_wrangler.bio_wrangler import BioWrangler

Initialize the BioWrangler class

wrangler = BioWrangler()

Load data from different formats

fasta_data = wrangler.load_fasta('path/to/sample.fasta') fastq_data = wrangler.load_fastq('path/to/sample.fastq') vcf_data = wrangler.load_vcf('path/to/sample.vcf') gff_data = wrangler.load_gff('path/to/sample.gff')

Display the first few rows of the datasets

print(fasta_data.head()) print(fastq_data.head()) print(vcf_data.head()) print(gff_data.head())

Filtering Data

You can filter the data by quality, chromosome, position, or specific attributes.

Example: Filtering FASTQ by Quality

filtered_fastq = wrangler.filter_fastq_by_quality(fastq_data, 30.0) print(filtered_fastq.head()) # Display FASTQ sequences with avg quality >= 30

Example: Filtering VCF by Chromosome and Position Range

filtered_vcf_by_chr = wrangler.filter_by_chromosome(vcf_data, 'chr1') filtered_vcf_by_pos = wrangler.filter_by_position_range(vcf_data, 100000, 500000)

print(filtered_vcf_by_chr.head()) print(filtered_vcf_by_pos.head())

Example: Filtering GFF by Attribute

filtered_gff = wrangler.filter_by_attribute(gff_data, 'ID', 'gene1') print(filtered_gff.head()) # Filter by gene ID

Summarizing Data

Generate a summary of the dataset, including total rows, average quality, and positional statistics.

Example: Summarizing FASTQ and VCF Data

fastq_summary = wrangler.summarize_fastq(fastq_data) vcf_summary = wrangler.summarize_data(vcf_data)

print(fastq_summary) print(vcf_summary)

Merging Datasets

Merge multiple datasets (e.g., two VCF datasets) into one for combined analysis.

Example: Merging VCF Datasets

merged_vcf = wrangler.merge_datasets(vcf_data, filtered_vcf_by_chr) print(merged_vcf.head()) # Combined dataset

Saving Data

You can save your processed data to a file in either CSV or Excel format.

Example: Saving Filtered VCF Data to a CSV File

wrangler.save_data(filtered_vcf_by_chr, 'filtered_vcf_output.csv', 'csv')

License

This project is licensed under the MIT License.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio_wrangler-0.2.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

bio_wrangler-0.2-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file bio_wrangler-0.2.tar.gz.

File metadata

  • Download URL: bio_wrangler-0.2.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bio_wrangler-0.2.tar.gz
Algorithm Hash digest
SHA256 ba28df9fa6792c8520dbe0af30ec2d2d96ccb1ec3eb6ee4f62e4c8e88b10d8cd
MD5 c4f3604a956ce86650b4ea0fdc398970
BLAKE2b-256 723af88e0a0c17f9946c954ed6b58768e691884af85360dd3788487ba2d74099

See more details on using hashes here.

File details

Details for the file bio_wrangler-0.2-py3-none-any.whl.

File metadata

  • Download URL: bio_wrangler-0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bio_wrangler-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c653277d381462047f157a3ba20869ba71cbb280353aa8403cdc699b279995b7
MD5 48b9a8c3f1ca362f69f426f3c8e79f1f
BLAKE2b-256 1302cf98bf126fb29f07a6804acc7c56fbeee2ff5d1bab981829ec7401e4166a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page