A python package for common biological data I/O

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Project description

biodata - A standard biological data processing package

The biodata package provides a standard API to access all different kinds of biological data using similar syntax. For each data type, data is processed by the corresponding reader (XX-Reader) and writer (XX-Writer) as a stream of entries. For example, FASTAReader is used to process FASTA file. A call to the method read() from FASTAReader yields a FASTA object. For indexed file, random access is supported through the XX-IReader. For example, an indexed FASTA file can be access by FASTAIReader.

Installation

pip install biodata

Basic usage

We will demonstrate the use of biodata package using FASTA file.

>seq1
ACGT
>seq2
CCCGGGAAA

Read the first entry

from biodata.fasta import FASTAReader
with FASTAReader(filename) as fr:
	f = fr.read()
	print(f.name, f.seq) # seq1 ACGT

Read entry by entry

from biodata.fasta import FASTAReader
with FASTAReader(filename) as fr:
	for f in fr:
		print(f.name, f.seq)
# seq1 ACGT
# seq2 CCCGGGAAA

Read all entries at once

from biodata.fasta import FASTAReader
fasta_entries = FASTAReader.read_all(list, filename) # list of FASTA

seq_dict = FASTAReader.read_all(lambda fr: {f.name:f.seq for f in fr}, filename) 
# A dictionary with fasta name as key and fasta sequence as value
# {"seq1": "ACGT", "seq2": "CCCGGGAAA"}

# For genomic range data, one could also use GenomicCollection to store them:
from biodata.bed import BEDReader
from genomictools import GenomicCollection
beds = BEDReader.read_all(GenomicCollection, filename)

Peek an entry

from biodata.fasta import FASTAReader
with FASTAReader(filename) as fr:
	f = fr.peek() # Only peek the entry without proceeding to the next entry
	print(f.name, f.seq) # seq1 ACGT
	f = fr.read() # Read the entry and proceed to the next entry
	print(f.name, f.seq) # seq1 ACGT
	f = fr.read()
	print(f.name, f.seq) # seq2 CCCGGGAAA

Read an entry from StringIO

# TextIOBase can be used as input
import io
from biodata.fasta import FASTAReader
FASTAReader.read_all(list, io.StringIO(">seq1\nACGT\n>seq2\nCCCGGGAAA\n"))

Read an indexed file

from biodata.fasta import FASTAIReader
from genomictools import GenomicPos, StrandedGenomicPos
from biodata.bed import BED

fir = FASTAIReader(filename, faifilename) # fai file can be created using 'samtools faidx filename'
f = fir[GenomicPos("seq2:1-4")] # Read from a region without strand
print(f.name, f.seq) # seq2:1-4 CCCG
f = fir[StrandedGenomicPos("seq2:1-4:-")] # Read from a region with strand
print(f.name, f.seq) # seq2:1-4:- CGGG
f = fir[BED("seq2", 0, 4, strand="-")] # Equivalent to StrandedGenomicPos but a BED entry is used
print(f.name, f.seq) # seq2:1-4:- CGGG
fir.close()

Write entry by entry

from biodata.fasta import FASTA, FASTAWriter
with FASTAWriter(output_file) as fw:
	fw.write(FASTA("seq1", "ACGT"))
	fw.write(FASTA("seq2", "CCCGGGAAA"))

Write all entries at once

from biodata.fasta import FASTA, FASTAWriter
fasta_entries = [FASTA("seq1", "ACGT"), fw.write(FASTA("seq2", "CCCGGGAAA"))]
FASTAWriter.write_all(fasta_entries, output_file)

List of supported format

Delimited - tsv, csv (biodata.delimited)
FASTA, FASTQ (biodata.fasta)
BED3, BED, BEDX, BEDGraph, BEDPE (biodata.bed)
bwa FastMap

Future supported formats.

GFF (biodata.gff)
VCF (biodata.vcf)
BigBed (biodata.bed)
BigWig (biodata.bigwig)

Extension of BaseReader

Users can extend the BaseReader and BaseWriter class easily.

class ExampleNode(object):
	def __init__(self, value1, value2):
		self.value1 = value1
		self.value2 = value2

class ExampleNodeReader(BaseReader):
	def __init__(self, filename):
		super(ExampleNodeReader, self).__init__(filename)
	def _read(self):
		if self.line is None:
			return None
		words_array = self.line.split('\t')
		value1 = words_array[0]
		value2 = words_array[1]
		self.proceed_next_line()
		return ExampleNode(value1, value2)

filename = "SomeDocument.txt"
with ExampleNodeReader(filename) as er:
	for node in er:
		print(node.value1, node.value2)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Release history Release notifications | RSS feed

0.1.0

Jun 6, 2024

0.0.9

May 26, 2024

0.0.8

May 23, 2024

0.0.7

May 7, 2024

0.0.6

Jan 23, 2023

0.0.5

Jan 21, 2023

This version

0.0.4

Jan 14, 2023

0.0.3

Jan 14, 2023

0.0.2

Jan 3, 2023

0.0.1

Nov 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biodata-0.0.4.tar.gz (12.8 kB view hashes)

Uploaded Jan 14, 2023 Source

Built Distribution

biodata-0.0.4-py3-none-any.whl (13.2 kB view hashes)

Uploaded Jan 14, 2023 Python 3

Hashes for biodata-0.0.4.tar.gz

Hashes for biodata-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`9f5059797637e87f26f77aef1ac94cbdda27b6cc1e2b6380392a15a337745741`
MD5	`08aef4bd81c07dee9238144dc7c3504a`
BLAKE2b-256	`71d7b6273dbb2bcc6b9d23ffff937b01b85412111ab047f37f30749f4ec756c2`

Hashes for biodata-0.0.4-py3-none-any.whl

Hashes for biodata-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`524817497787faa4052571ee4c7ed815cb38ddb466a9bc75a6c74000b2c61111`
MD5	`3f1a83d7a359e3790a1845f68316240e`
BLAKE2b-256	`e035a8f72c0764ae066a8d9f89392706f2a19f5992d7eed0d2159d963362cc84`