Lightweight IO and conversion for bioinformatics file formats.
Project description
💻 bioino
Command-line tools and Python API for interconverting FASTA, GFF, and CSV.
bioino converts tables to FASTA, and GFF to tables. It also provides a Python API for reading, writing, and querying GFF and FASTA files.
Installation
The easy way
pip install bioino
From source
Clone the repository, then cd into it and run:
pip install -e .
Usage
Command line
Info goes to stderr, so output can be piped freely.
gff2table
Convert a GFF file to TSV (default) or CSV.
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' \
| bioino gff2table 2>/dev/null
seqid source feature start end score strand phase ID attr1
test_seq test_source gene 1 10 . + . test01 +
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' \
| bioino gff2table -f CSV 2>/dev/null
seqid,source,feature,start,end,score,strand,phase,ID,attr1
test_seq,test_source,gene,1,10,.,+,.,test01,+
Pass --metadata / -m to include the GFF header as commented lines in the output.
table2fasta
Convert a CSV or TSV table of sequences to FASTA.
$ printf 'name\tseq\tdata\nSeq1\tAAAAA\tSome-info\n' \
| bioino table2fasta -n name -s seq -d data 2>/dev/null
>Seq1 data=Some-info
AAAAA
Multiple --name columns are concatenated with _; multiple --description columns
are formatted as key=value pairs separated by ;.
Detailed usage
usage: bioino [-h] [--version] {gff2table,table2fasta} ...
Interconvert some bioinformatics file formats.
options:
-h, --help show this help message and exit
--version, -v show program's version number and exit
Sub-commands:
{gff2table,table2fasta}
gff2table Convert a GFF to a TSV file.
table2fasta Convert a CSV or TSV of sequences to a FASTA file.
usage: bioino gff2table [-h] [--format {TSV,CSV}] [--metadata]
[--output OUTPUT]
[input]
positional arguments:
input Input file in GFF format. Default: stdin.
options:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
Output format. Default: "TSV".
--metadata, -m Write GFF header as commented lines.
--output OUTPUT, -o OUTPUT
Output file. Default: stdout.
usage: bioino table2fasta [-h] [--format {TSV,CSV}] [--sequence SEQUENCE]
--name [NAME ...] [--description [DESCRIPTION ...]]
[--worksheet WORKSHEET] [--output OUTPUT]
[input]
positional arguments:
input Input table file (TSV, CSV, or XLSX). Default: stdin.
options:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
Input format. Default: "TSV".
--sequence SEQUENCE, -s SEQUENCE
Column to take sequence from. Default: "sequence".
--name [NAME ...], -n [NAME ...]
Column(s) for sequence name. Concatenated with "_",
spaces replaced with "-". Required.
--description [DESCRIPTION ...], -d [DESCRIPTION ...]
Column(s) for sequence description. Formatted as
"key=value" pairs separated by ";", spaces replaced
with "_". Default: omitted.
--worksheet WORKSHEET, -w WORKSHEET
For XLSX files, the worksheet to read. Default: "Sheet 1".
--output OUTPUT, -o OUTPUT
Output file. Default: stdout.
Python API
FASTA
FastaSequence is a dataclass holding a sequence name, description, and sequence string.
FastaCollection wraps an iterable of FastaSequence objects.
>>> from bioino import FastaSequence, FastaCollection
>>> seq1 = FastaSequence("example", "This is a description", "ATCG")
>>> seq2 = FastaSequence("example2", "This is another sequence", "GGGAAAA")
>>> FastaCollection([seq1, seq2]).write()
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA
Read from a file handle or filename with FastaCollection.from_file():
>>> from io import StringIO
>>> buf = StringIO()
>>> FastaCollection([seq1, seq2]).write(buf)
>>> buf.seek(0)
0
>>> FastaCollection.from_file(buf).write()
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA
Build a FastaCollection from a Pandas DataFrame with FastaCollection.from_pandas().
The names columns are concatenated with name_sep (default _); descriptions columns
are formatted as key=value pairs separated by desc_sep (default ;).
>>> import pandas as pd
>>> from bioino import FastaCollection
>>> df = pd.DataFrame(dict(
... seq=['atcg', 'aaaa'],
... title=['seq1', 'seq2'],
... info=['SeqA', 'SeqB'],
... score=[1, 2],
... ))
>>> FastaCollection.from_pandas(df, sequence='seq',
... names=['title'],
... descriptions=['info', 'score']).write()
>seq1 info=SeqA;score=1
atcg
>seq2 info=SeqB;score=2
aaaa
>>> FastaCollection.from_pandas(df, sequence='seq',
... names=['title', 'info'],
... descriptions=['score']).write()
>seq1_SeqA score=1
atcg
>seq2_SeqB score=2
aaaa
GFF
Makes an attempt to conform to GFF3 but makes no guarantees.
Reading and writing
GffFile.from_file() accepts a file handle or filename and returns a GffFile that
streams records lazily.
>>> from io import StringIO
>>> from bioino import GffFile
>>> lines = [
... "##meta1 item1",
... "#meta2 item2 comment",
... "\t".join("test_seq test_source gene 1 10 . + . ID=test01;attr1=+".split()),
... "\t".join("test_seq test_source gene 9 100 . + . Parent=test01;attr2=+".split()),
... ]
>>> gff = GffFile.from_file(StringIO("\n".join(lines)))
>>> gff.write()
##meta1 item1
#meta2 item2 comment
test_seq test_source gene 1 10 . + . ID=test01;attr1=+
test_seq test_source gene 9 100 . + . Parent=test01;attr2=+
Converting to table
GffFile.to_csv() writes a flat table with one row per GFF line, columns for the eight
standard GFF fields plus all unique attribute keys. Use sep='\t' for TSV output.
>>> from io import StringIO
>>> from bioino import GffFile
>>> lines = [
... "\t".join("TEST test gene 1 100 . + + ID=test001;comment=Test".split()),
... "\t".join("TEST test gene 121 120 . + - ID=test001;tag=test_tag".split()),
... ]
>>> GffFile.from_file(StringIO("\n".join(lines))).to_csv()
seqid,source,feature,start,end,score,strand,phase,ID,comment,tag
TEST,test,gene,1,100,.,+,+,test001,Test,
TEST,test,gene,121,120,.,+,-,test001,,test_tag
Interconversion
GffLine.from_dict() constructs a GffLine from a dictionary. Keys matching the standard
GFF column names (seqid, source, feature, start, end, score, strand, phase)
populate the columns; all other keys become attributes.
>>> from bioino import GffLine
>>> d = dict(seqid='TEST', source='test', feature='gene',
... start=1, end=100, score='.', strand='+', phase='+')
>>> print(GffLine.from_dict(d))
TEST test gene 1 100 . + +
>>> d.update(dict(ID='test001', comment='This is a test'))
>>> GffLine.from_dict(d).write()
TEST test gene 1 100 . + + ID=test001;comment=This is a test
GffFile.as_dict() yields each line as a flat dictionary:
>>> from io import StringIO
>>> from bioino import GffFile
>>> lines = [
... "TEST\ttest\tgene\t1\t100\t.\t+\t+\tID=test001;comment=Test",
... "TEST2\ttest2\tgene\t101\t200\t.\t+\t+\tID=test002;comment=Test2",
... ]
>>> list(GffFile.from_file(StringIO("\n".join(lines))).as_dict())
[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100,
'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'},
{'seqid': 'TEST2', 'source': 'test2', 'feature': 'gene', 'start': 101, 'end': 200,
'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test002', 'comment': 'Test2'}]
Positional lookup
GffFile can build a per-chromosome interval index for fast positional annotation queries.
Pass lookup=True to GffFile.from_file().
>>> from io import StringIO
>>> from bioino import GffFile
>>> lines = [
... "\t".join(["chr1", "src", "gene", "10", "50", ".", "+", ".", "ID=g1;Name=geneA"]),
... "\t".join(["chr1", "src", "gene", "100", "150", ".", "+", ".", "ID=g2;Name=geneB"]),
... "\t".join(["chr2", "src", "gene", "20", "80", ".", "-", ".", "ID=g3;Name=geneC"]),
... ]
>>> gff = GffFile.from_file(StringIO("\n".join(lines)), lookup=True)
Query with lookup_at(seqid, pos), which returns a tuple of GffLine objects covering
that position. Each returned line has locus_tag and offset attributes computed for
that exact position.
# Gene body — offset from annotated start (+ strand) or end (- strand)
>>> r = gff.lookup_at('chr1', 30)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('geneA', 20)
# Intergenic — first half of gap attributed to upstream gene
>>> r = gff.lookup_at('chr1', 75)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('_down-geneA', 65)
# Intergenic — second half of gap attributed to downstream gene
>>> r = gff.lookup_at('chr1', 76)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('_up-geneB', 24)
# Up to 1000 bp past the last annotated feature is covered
>>> r = gff.lookup_at('chr1', 200)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('_down-geneB', 100)
# Each chromosome is indexed independently
>>> r = gff.lookup_at('chr2', 50)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('geneC', 30)
# Returns an empty tuple for unknown seqids or positions outside all intervals
>>> gff.lookup_at('chrX', 50)
()
The lookup index:
- handles multi-chromosome GFFs
- only indexes parent features (
Nameattribute present, noParentattribute) - ignores feature types
regionandrepeat_region - stores references to the original
GffLineobjects; offsets are computed on demand
Suggestions, issues, fixes
File an issue here.
Documentation
API reference at bioino.readthedocs.org.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bioino-0.0.3.tar.gz.
File metadata
- Download URL: bioino-0.0.3.tar.gz
- Upload date:
- Size: 21.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be52b7c84009ca2305628117f742e8b69f192fb486c2eef5ced5ee07c493ba97
|
|
| MD5 |
8a45a6d57994abf19d8ea5d8db3fba43
|
|
| BLAKE2b-256 |
e39474640ab16f86ed841d0b1ac237b7c5c65e64593bba73de30a92d75c168a6
|
File details
Details for the file bioino-0.0.3-py3-none-any.whl.
File metadata
- Download URL: bioino-0.0.3-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f406518a5e5bf33a93b3a720bbd173dd2193c3be4358b4ea65c51956e0d6f3f
|
|
| MD5 |
7791dd21a3718e6f92425551f6571dc4
|
|
| BLAKE2b-256 |
bd1c85b5ea88fd3cf1de6d13e9b0967f9c9d097143d3b6ce105523c2dc16e47b
|