Skip to main content

A python package for common biological data I/O

Project description

biodata - A standard biological data processing package

The biodata package provides a standard API to access different kinds of biological data using similar syntax. For each data type, data could be accessed in either stream- or index-based method. To obtain a stream of entries, data is processed by the reader (XX-Reader) and writer (XX-Writer). For example, FASTAReader is used to process a FASTA file. A call to the method read() from FASTAReader yields a FASTA object. To obtain entries from an indexed file, random access is supported through the XX-IReader. For example, an indexed FASTA file (with the associated .fai file) can be accessed by FASTAIReader to yield a FASTA object.

Installation

pip install biodata

Quick Start

For more advanced use, please see the Basic Usage section.

# Reading file contents
from biodata.fasta import FASTAReader
seq_dict = FASTAReader.read_all(lambda fr: {f.name:f.seq for f in fr}, "input.fa") 

from biodata.bed import BEDReader
from genomictools import GenomicCollection
beds = BEDReader.read_all(GenomicCollection, "input.bed")

from biodata.gff import GFF3Reader
from genomictools import GenomicCollection
gff3s = GFF3Reader.read_all(GenomicCollection, "input.gff3")

from biodata.gff import GTFReader
from genomictools import GenomicCollection
gtfs = GTFReader.read_all(GenomicCollection, "input.gtf")

Basic usage

We will demonstrate the use of biodata package using FASTA file.

>seq1
ACGT
>seq2
CCCGGGAAA

Reading data

Read the first entry

from biodata.fasta import FASTAReader
with FASTAReader(filename) as fr:
	f = fr.read()
	print(f.name, f.seq) # seq1 ACGT

Read entry by entry

from biodata.fasta import FASTAReader
with FASTAReader(filename) as fr:
	for f in fr:
		print(f.name, f.seq)
# seq1 ACGT
# seq2 CCCGGGAAA

with BEDReader(bedfile) as br:
	for b in br:
		print(b.name, str(b.genomic_pos))

Read all entries at once

from biodata.fasta import FASTAReader
fasta_entries = FASTAReader.read_all(list, filename) # list of FASTA

seq_dict = FASTAReader.read_all(lambda fr: {f.name:f.seq for f in fr}, filename) 
# A dictionary with fasta name as key and fasta sequence as value
# {"seq1": "ACGT", "seq2": "CCCGGGAAA"}

# For genomic range data, one could also use GenomicCollection to store them:
from biodata.bed import BEDReader
from genomictools import GenomicCollection
beds = BEDReader.read_all(GenomicCollection, filename)

Peek an entry

from biodata.fasta import FASTAReader
with FASTAReader(filename) as fr:
	f = fr.peek() # Only peek the entry without proceeding to the next entry
	print(f.name, f.seq) # seq1 ACGT
	f = fr.read() # Read the entry and proceed to the next entry
	print(f.name, f.seq) # seq1 ACGT
	f = fr.read()
	print(f.name, f.seq) # seq2 CCCGGGAAA

Read an entry from StringIO

# TextIOBase can be used as input instead of a file
import io
from biodata.fasta import FASTAReader
FASTAReader.read_all(list, io.StringIO(">seq1\nACGT\n>seq2\nCCCGGGAAA\n"))

Read an indexed file

from biodata.fasta import FASTAIReader
from genomictools import GenomicPos, StrandedGenomicPos
from biodata.bed import BED

fir = FASTAIReader(filename, faifilename) # fai file can be created using 'samtools faidx filename'
f = fir[GenomicPos("seq2:1-4")] # Read from a region without strand
print(f.name, f.seq) # seq2:1-4 CCCG
f = fir[StrandedGenomicPos("seq2:1-4:-")] # Read from a region with strand
print(f.name, f.seq) # seq2:1-4:- CGGG
f = fir[BED("seq2", 0, 4, strand="-")] # Equivalent to StrandedGenomicPos but a BED entry is used
print(f.name, f.seq) # seq2:1-4:- CGGG
fir.close()

Writing

Write entry by entry

from biodata.fasta import FASTA, FASTAWriter
with FASTAWriter(output_file) as fw:
	fw.write(FASTA("seq1", "ACGT"))
	fw.write(FASTA("seq2", "CCCGGGAAA"))

Write all entries at once

from biodata.fasta import FASTA, FASTAWriter
fasta_entries = [FASTA("seq1", "ACGT"), fw.write(FASTA("seq2", "CCCGGGAAA"))]
FASTAWriter.write_all(fasta_entries, output_file)

List of supported format

  1. Delimited - tsv, csv (biodata.delimited)
  2. FASTA, FASTQ (biodata.fasta)
  3. BED3, BED, BEDX, BEDGraph, BEDPE, BigBed, ENCODENarrowPeak (biodata.bed)
  4. GFF3, GTF (GFF2) (biodata.gff)
  5. BigWig (biodata.bigwig, require pyBigWig package)
  6. bwa FastMap (biodata.bwa.fastmap)
  7. MEME Motif Format (biodata.meme)

Future supported formats.

  1. VCF (biodata.vcf)

Extension of BaseReader

Users can extend the BaseReader and BaseWriter class easily to accommodate other formats not currently supported by biodata.

class ExampleNode(object):
	def __init__(self, value1, value2):
		self.value1 = value1
		self.value2 = value2

class ExampleNodeReader(BaseReader):
	def __init__(self, filename):
		super(ExampleNodeReader, self).__init__(filename)
	def _read(self):
		if self.line is None:
			return None
		words_array = self.line.split('\t')
		value1 = words_array[0]
		value2 = words_array[1]
		self.proceed_next_line()
		return ExampleNode(value1, value2)

filename = "SomeDocument.txt"
with ExampleNodeReader(filename) as er:
	for node in er:
		print(node.value1, node.value2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biodata-0.1.3.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

biodata-0.1.3-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file biodata-0.1.3.tar.gz.

File metadata

  • Download URL: biodata-0.1.3.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.10.4

File hashes

Hashes for biodata-0.1.3.tar.gz
Algorithm Hash digest
SHA256 45f33768e8c8465ea6a1f1d5f887016c2a555a30f9b82c8aa946e6a292aa8776
MD5 7154d6180a7e03b524e9ad31a9ac3c7c
BLAKE2b-256 f819094b9b2d2a7c3841ef827e8b1bfc5737554a59afbc39e9e23179ab8f3537

See more details on using hashes here.

File details

Details for the file biodata-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: biodata-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.10.4

File hashes

Hashes for biodata-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d3775195dc5d7fc58c2bf2e5a70e43f45e695d646d79abb2ab34fd97e7bcb053
MD5 f38d0253540bc14e83a7f109f42bae33
BLAKE2b-256 3dc49710bbb0f0bb580763d72605e1923a8c7ed4d04bc2d75bed94d12e8c6628

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page