Lightweight IO and conversion for bioinformatics file formats.
Project description
💻 bioino
Command-line tools and Python API for interconverting FASTA, GFF, and CSV.
bioino currently converts tables to FASTA, and GFF to tables. Also provides a Python API for handling GFF and FASTA files, and converting to table files.
Warning: bioino is under active development, and not fully tested, so things may change, break, or simply not work.
Installation
The easy way
Install the pre-compiled version from PyPI:
pip install bioino
From source
Clone the repository, then cd
into it. Then run:
pip install -e .
Usage
Command line
Convert CSV or XLSX of sequences to a FASTA file. Info goes to stderr
, so you can pipe the output you
want to other tools or to a file.
$ printf 'name\tseq\tdata\nSeq1\tAAAAA\tSome-info\n' | bioino table2fasta -n name -s seq -d data
🚀 Generating FASTA from tables with the following parameters:
subcommand: table2fasta
input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
format: TSV
sequence: seq
name: ['name']
description: ['data']
worksheet: Sheet 1
output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
func: <function _table2fasta at 0x7f4b48a43d30>
>Seq1 data=Some-info
AAAAA
⏰ Completed process in 0:00:00.025771
Convert GFF tables to TSV (or CSV).
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | bioino gff2table 2> /dev/null
seqid source feature start end score strand phase ID attr1
test_seq test_source gene 1 10 . + . test01 +
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | bioino gff2table -f CSV 2> /dev/nul
l
seqid,source,feature,start,end,score,strand,phase,ID,attr1
test_seq,test_source,gene,1,10,.,+,.,test01,+
Detailed usage
$ bioino --help
usage: bioino [-h] {gff2table,table2fasta} ...
Interconvert some bioinformatics file formats.
optional arguments:
-h, --help show this help message and exit
Sub-commands:
{gff2table,table2fasta}
Use these commands to specify the tool you want to use.
gff2table Convert a GFF to a TSV file.
table2fasta Convert a CSV or TSV of sequences to a FASTA file.
$ bioino gff2table --help
usage: bioino gff2table [-h] [--format {TSV,CSV}] [--metadata] [--output OUTPUT] [input]
positional arguments:
input Input file in GFF format. Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".
optional arguments:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
File format. Default: "TSV".
--metadata, -m Write GFF header as commented lines.
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
$ bioino table2fasta --help
usage: bioino table2fasta [-h] [--format {TSV,CSV}] [--sequence SEQUENCE] --name [NAME [NAME ...]]
[--description [DESCRIPTION [DESCRIPTION ...]]] [--worksheet WORKSHEET] [--output OUTPUT]
[input]
positional arguments:
input Input file in GFF format. Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".
optional arguments:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
File format. Default: "TSV".
--sequence SEQUENCE, -s SEQUENCE
Column to take sequence from. Default: "sequence".
--name [NAME [NAME ...]], -n [NAME [NAME ...]]
Column(s) to take sequence name from. Concatenates values with "_", replaces spaces with "-". Required.
--description [DESCRIPTION [DESCRIPTION ...]], -d [DESCRIPTION [DESCRIPTION ...]]
Column(s) to take sequence description from. Concatenates values with ";", replaces spaces with "_".
Default: don't use.
--worksheet WORKSHEET, -w WORKSHEET
For XLSX files, the worksheet to take the table from. Default: "Sheet 1".
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
Python API
FASTA
Read FASTA files (or strings) into iterators of named tuples.
>>> from bioino import FastaSequence, FastaCollection
>>> seq1 = FastaSequence("example", "This is a description", "ATCG")
>>> seq1
FastaSequence(name='example', description='This is a description', sequence='ATCG')
>>> seq2 = FastaSequence("example2", "This is another sequence", "GGGAAAA")
>>> fasta_stream = FastaCollection([seq1, seq2])
>>> fasta_stream
FastaCollection(sequences=[FastaSequence(name='example', description='This is a description', sequence='ATCG'), FastaSequence(name='example2', description='This is another sequence', sequence='GGGAAAA')])
These objects show as FASTA format when written, toptionally to a file.
>>> fasta_stream.write()
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA
GFF
Makes an attempt to conform to GFF3 but makes no guarantees.
Similar to the FSAT utiities, GFF is read into an object.
>>> from io import StringIO
>>> from bioino import GffFile
>>> lines = ["##meta1 item1",
... "#meta2 item2 comment",
... '\t'.join("test_seq test_source gene 1 10 . + . ID=test01;attr1=+".split()),
... '\t'.join("test_seq test_source gene 9 100 . + . Parent=test01;attr2=+".split())]
>>> file = StringIO()
>>> for line in lines:
... print(line, file=file)
>>> gff = GffFile.from_file(file)
These render as GFF lines when printed.
>>> gff.write()
##meta1 item1
#meta2 item2 comment
test_seq test_source gene 1 10 . + . ID=test01;attr1=+
test_seq test_source gene 9 100 . + . Parent=test01;attr2=+
GFF lookup table
An iterable of GffLine
s can be converted into a lookup table mapping
chromosome location to feature annotations. Regions without annotation
are automatically filled with references to upstream or
downstream features.
Just create a GffFile
with lookup=True
, or use the _lookup_table()
method of an instantiated GffFile
.
There are currently some limitations:
- Currently only works for single-chromosome files.
- Only references parent features. Child features not yet indexed.
- Will not work for GFFs with a single parent feature.
- Ignores the following feature types: "region", :repeat_region"
Interconversion
GFFLine
s can be converted to dictionaries and vice versa.
>>> from bioino import GffLine
>>> d = dict(seqid='TEST', source='test', feature='gene', start=1, end=100, score='.', strand='+', phase='+')
>>> print(GffLine.from_dict(d))
TEST test gene 1 100 . + +
>>> d.update(dict(ID='test001', comment='This is a test'))
>>> GffLine.from_dict(d).write()
TEST test gene 1 100 . + + ID=test001;comment=This is a test
>>> from io import StringIO
>>> from bioino import GffFile
>>> file = StringIO()
>>> lines = ["TEST test gene 1 100 . + + ID=test001;comment=Test".split(),
... "TEST2 test2 gene 101 200 . + + ID=test002;comment=Test2".split()]
>>> for line in lines:
... print('\t'.join(line), file=file)
>>> list(GffFile.from_file(file).as_dict())
[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'}, {'seqid': 'TEST2', 'source': 'test2', 'feature': 'gene', 'start': 101, 'end': 200, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test002', 'comment': 'Test2'}]
And Pandas DataFrames can be converted to FASTA.
>>> import pandas as pd
>>> df = pd.DataFrame(dict(seq=['atcg', 'aaaa'],
... title=['seq1', 'seq2'],
... info=['SeqA', 'SeqB'],
... score=[1, 2]))
>>> df
seq title info score
0 atcg seq1 SeqA 1
1 aaaa seq2 SeqB 2
>>> FastaCollection.from_pandas(df, sequence='seq',
... names=['title'],
... descriptions=['info', 'score']).write()
>seq1 info=SeqA;score=1
atcg
>seq2 info=SeqB;score=2
aaaa
Suggestions, issues, fixes
File an issue here.
Documentation
Check the API here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bioino-0.0.2.post1.tar.gz
.
File metadata
- Download URL: bioino-0.0.2.post1.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63bdbed2f2cb1c8defd93c25387e86e9acba0dead51a994fef8dc4e2da060eb5 |
|
MD5 | 338e618311aa8abe1c2986326801df97 |
|
BLAKE2b-256 | e4c894bea1143b0d5bf4fe0f309f43eb744fb87ac1058c36b824a15fcedff846 |
File details
Details for the file bioino-0.0.2.post1-py3-none-any.whl
.
File metadata
- Download URL: bioino-0.0.2.post1-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50e737582822f27be90006518930a35f0109ca6d2f9a4ec03148d97dd6c07797 |
|
MD5 | f90a26f356df72b10d07a949b2f87162 |
|
BLAKE2b-256 | de91b71747acd7cf215dd2f0240515f07d6f429896e8ca5213cb76b1d130beef |