A bioinformatics data wrangling package for FASTA, FASTQ, VCF, and GFF files.
Project description
Bio-Wrangler
Bio-Wrangler is a bioinformatics data wrangling package for handling FASTA, FASTQ, VCF, and GFF files. It helps load, filter, merge, and summarize biological datasets in an easy and efficient manner.
Features
- Load FASTA, FASTQ, VCF, and GFF files into pandas DataFrames.
- Filter data by quality, chromosome, position, and specific attributes.
- Merge and summarize datasets.
- Save data to CSV or Excel formats.
Installation
You can install Bio-Wrangler directly from PyPI:
pip install bio-wrangler
Usage
Here’s how to use Bio-Wrangler to load, filter, and manipulate your bioinformatics datasets.
Loading Data
You can load data from FASTA, FASTQ, VCF, and GFF formats into pandas DataFrames for easy manipulation.
Example: Loading FASTA, FASTQ, VCF, and GFF Files
from bio_wrangler.bio_wrangler import BioWrangler
Initialize the BioWrangler class
wrangler = BioWrangler()
Load data from different formats
fasta_data = wrangler.load_fasta('path/to/sample.fasta') fastq_data = wrangler.load_fastq('path/to/sample.fastq') vcf_data = wrangler.load_vcf('path/to/sample.vcf') gff_data = wrangler.load_gff('path/to/sample.gff')
Display the first few rows of the datasets
print(fasta_data.head()) print(fastq_data.head()) print(vcf_data.head()) print(gff_data.head())
Filtering Data
You can filter the data by quality, chromosome, position, or specific attributes.
Example: Filtering FASTQ by Quality
filtered_fastq = wrangler.filter_fastq_by_quality(fastq_data, 30.0) print(filtered_fastq.head()) # Display FASTQ sequences with avg quality >= 30
Example: Filtering VCF by Chromosome and Position Range
filtered_vcf_by_chr = wrangler.filter_by_chromosome(vcf_data, 'chr1') filtered_vcf_by_pos = wrangler.filter_by_position_range(vcf_data, 100000, 500000)
print(filtered_vcf_by_chr.head()) print(filtered_vcf_by_pos.head())
Example: Filtering GFF by Attribute
filtered_gff = wrangler.filter_by_attribute(gff_data, 'ID', 'gene1') print(filtered_gff.head()) # Filter by gene ID
Summarizing Data
Generate a summary of the dataset, including total rows, average quality, and positional statistics.
Example: Summarizing FASTQ and VCF Data
fastq_summary = wrangler.summarize_fastq(fastq_data) vcf_summary = wrangler.summarize_data(vcf_data)
print(fastq_summary) print(vcf_summary)
Merging Datasets
Merge multiple datasets (e.g., two VCF datasets) into one for combined analysis.
Example: Merging VCF Datasets
merged_vcf = wrangler.merge_datasets(vcf_data, filtered_vcf_by_chr) print(merged_vcf.head()) # Combined dataset
Saving Data
You can save your processed data to a file in either CSV or Excel format.
Example: Saving Filtered VCF Data to a CSV File
wrangler.save_data(filtered_vcf_by_chr, 'filtered_vcf_output.csv', 'csv')
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bio_wrangler-0.2.tar.gz
.
File metadata
- Download URL: bio_wrangler-0.2.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba28df9fa6792c8520dbe0af30ec2d2d96ccb1ec3eb6ee4f62e4c8e88b10d8cd |
|
MD5 | c4f3604a956ce86650b4ea0fdc398970 |
|
BLAKE2b-256 | 723af88e0a0c17f9946c954ed6b58768e691884af85360dd3788487ba2d74099 |
File details
Details for the file bio_wrangler-0.2-py3-none-any.whl
.
File metadata
- Download URL: bio_wrangler-0.2-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c653277d381462047f157a3ba20869ba71cbb280353aa8403cdc699b279995b7 |
|
MD5 | 48b9a8c3f1ca362f69f426f3c8e79f1f |
|
BLAKE2b-256 | 1302cf98bf126fb29f07a6804acc7c56fbeee2ff5d1bab981829ec7401e4166a |