Skip to main content

Lightweight IO and conversion for bioinformatics file formats.

Project description

💻 bioino

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Command-line tools and Python API for interconverting FASTA, GFF, and CSV.

bioino converts tables to FASTA, and GFF to tables. It also provides a Python API for reading, writing, and querying GFF and FASTA files.

Installation

The easy way

pip install bioino

From source

Clone the repository, then cd into it and run:

pip install -e .

Usage

Command line

Info goes to stderr, so output can be piped freely.

gff2table

Convert a GFF file to TSV (default) or CSV.

$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' \
    | bioino gff2table 2>/dev/null
seqid   source  feature start   end     score   strand  phase   ID      attr1
test_seq        test_source     gene    1       10      .       +       .       test01  +
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' \
    | bioino gff2table -f CSV 2>/dev/null
seqid,source,feature,start,end,score,strand,phase,ID,attr1
test_seq,test_source,gene,1,10,.,+,.,test01,+

Pass --metadata / -m to include the GFF header as commented lines in the output.

table2fasta

Convert a CSV or TSV table of sequences to FASTA.

$ printf 'name\tseq\tdata\nSeq1\tAAAAA\tSome-info\n' \
    | bioino table2fasta -n name -s seq -d data 2>/dev/null
>Seq1 data=Some-info
AAAAA

Multiple --name columns are concatenated with _; multiple --description columns are formatted as key=value pairs separated by ;.

Detailed usage

usage: bioino [-h] [--version] {gff2table,table2fasta} ...

Interconvert some bioinformatics file formats.

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

Sub-commands:
  {gff2table,table2fasta}
    gff2table           Convert a GFF to a TSV file.
    table2fasta         Convert a CSV or TSV of sequences to a FASTA file.
usage: bioino gff2table [-h] [--format {TSV,CSV}] [--metadata]
                        [--output OUTPUT]
                        [input]

positional arguments:
  input                 Input file in GFF format. Default: stdin.

options:
  -h, --help            show this help message and exit
  --format {TSV,CSV}, -f {TSV,CSV}
                        Output format. Default: "TSV".
  --metadata, -m        Write GFF header as commented lines.
  --output OUTPUT, -o OUTPUT
                        Output file. Default: stdout.
usage: bioino table2fasta [-h] [--format {TSV,CSV}] [--sequence SEQUENCE]
                          --name [NAME ...] [--description [DESCRIPTION ...]]
                          [--worksheet WORKSHEET] [--output OUTPUT]
                          [input]

positional arguments:
  input                 Input table file (TSV, CSV, or XLSX). Default: stdin.

options:
  -h, --help            show this help message and exit
  --format {TSV,CSV}, -f {TSV,CSV}
                        Input format. Default: "TSV".
  --sequence SEQUENCE, -s SEQUENCE
                        Column to take sequence from. Default: "sequence".
  --name [NAME ...], -n [NAME ...]
                        Column(s) for sequence name. Concatenated with "_",
                        spaces replaced with "-". Required.
  --description [DESCRIPTION ...], -d [DESCRIPTION ...]
                        Column(s) for sequence description. Formatted as
                        "key=value" pairs separated by ";", spaces replaced
                        with "_". Default: omitted.
  --worksheet WORKSHEET, -w WORKSHEET
                        For XLSX files, the worksheet to read. Default: "Sheet 1".
  --output OUTPUT, -o OUTPUT
                        Output file. Default: stdout.

Python API

FASTA

FastaSequence is a dataclass holding a sequence name, description, and sequence string. FastaCollection wraps an iterable of FastaSequence objects.

>>> from bioino import FastaSequence, FastaCollection

>>> seq1 = FastaSequence("example", "This is a description", "ATCG")
>>> seq2 = FastaSequence("example2", "This is another sequence", "GGGAAAA")
>>> FastaCollection([seq1, seq2]).write()
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA

Read from a file handle or filename with FastaCollection.from_file():

>>> from io import StringIO
>>> buf = StringIO()
>>> FastaCollection([seq1, seq2]).write(buf)
>>> buf.seek(0)
0
>>> FastaCollection.from_file(buf).write()
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA

Build a FastaCollection from a Pandas DataFrame with FastaCollection.from_pandas(). The names columns are concatenated with name_sep (default _); descriptions columns are formatted as key=value pairs separated by desc_sep (default ;).

>>> import pandas as pd
>>> from bioino import FastaCollection

>>> df = pd.DataFrame(dict(
...     seq=['atcg', 'aaaa'],
...     title=['seq1', 'seq2'],
...     info=['SeqA', 'SeqB'],
...     score=[1, 2],
... ))
>>> FastaCollection.from_pandas(df, sequence='seq',
...                             names=['title'],
...                             descriptions=['info', 'score']).write()
>seq1 info=SeqA;score=1
atcg
>seq2 info=SeqB;score=2
aaaa
>>> FastaCollection.from_pandas(df, sequence='seq',
...                             names=['title', 'info'],
...                             descriptions=['score']).write()
>seq1_SeqA score=1
atcg
>seq2_SeqB score=2
aaaa

GFF

Makes an attempt to conform to GFF3 but makes no guarantees.

Reading and writing

GffFile.from_file() accepts a file handle or filename and returns a GffFile that streams records lazily.

>>> from io import StringIO
>>> from bioino import GffFile

>>> lines = [
...     "##meta1 item1",
...     "#meta2  item2  comment",
...     "\t".join("test_seq test_source gene 1 10 . + . ID=test01;attr1=+".split()),
...     "\t".join("test_seq test_source gene 9 100 . + . Parent=test01;attr2=+".split()),
... ]
>>> gff = GffFile.from_file(StringIO("\n".join(lines)))
>>> gff.write()
##meta1 item1
#meta2  item2  comment
test_seq    test_source     gene    1       10      .       +       .       ID=test01;attr1=+
test_seq    test_source     gene    9       100     .       +       .       Parent=test01;attr2=+

Converting to table

GffFile.to_csv() writes a flat table with one row per GFF line, columns for the eight standard GFF fields plus all unique attribute keys. Use sep='\t' for TSV output.

>>> from io import StringIO
>>> from bioino import GffFile

>>> lines = [
...     "\t".join("TEST test gene 1 100 . + + ID=test001;comment=Test".split()),
...     "\t".join("TEST test gene 121 120 . + - ID=test001;tag=test_tag".split()),
... ]
>>> GffFile.from_file(StringIO("\n".join(lines))).to_csv()
seqid,source,feature,start,end,score,strand,phase,ID,comment,tag
TEST,test,gene,1,100,.,+,+,test001,Test,
TEST,test,gene,121,120,.,+,-,test001,,test_tag

Interconversion

GffLine.from_dict() constructs a GffLine from a dictionary. Keys matching the standard GFF column names (seqid, source, feature, start, end, score, strand, phase) populate the columns; all other keys become attributes.

>>> from bioino import GffLine

>>> d = dict(seqid='TEST', source='test', feature='gene',
...          start=1, end=100, score='.', strand='+', phase='+')
>>> print(GffLine.from_dict(d))
TEST    test    gene    1       100     .       +       +

>>> d.update(dict(ID='test001', comment='This is a test'))
>>> GffLine.from_dict(d).write()
TEST    test    gene    1       100     .       +       +       ID=test001;comment=This is a test

GffFile.as_dict() yields each line as a flat dictionary:

>>> from io import StringIO
>>> from bioino import GffFile

>>> lines = [
...     "TEST\ttest\tgene\t1\t100\t.\t+\t+\tID=test001;comment=Test",
...     "TEST2\ttest2\tgene\t101\t200\t.\t+\t+\tID=test002;comment=Test2",
... ]
>>> list(GffFile.from_file(StringIO("\n".join(lines))).as_dict())
[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100,
  'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'},
 {'seqid': 'TEST2', 'source': 'test2', 'feature': 'gene', 'start': 101, 'end': 200,
  'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test002', 'comment': 'Test2'}]

Positional lookup

GffFile can build a per-chromosome interval index for fast positional annotation queries. Pass lookup=True to GffFile.from_file().

>>> from io import StringIO
>>> from bioino import GffFile

>>> lines = [
...     "\t".join(["chr1", "src", "gene", "10",  "50",  ".", "+", ".", "ID=g1;Name=geneA"]),
...     "\t".join(["chr1", "src", "gene", "100", "150", ".", "+", ".", "ID=g2;Name=geneB"]),
...     "\t".join(["chr2", "src", "gene", "20",  "80",  ".", "-", ".", "ID=g3;Name=geneC"]),
... ]
>>> gff = GffFile.from_file(StringIO("\n".join(lines)), lookup=True)

Query with lookup_at(seqid, pos), which returns a tuple of GffLine objects covering that position. Each returned line has locus_tag and offset attributes computed for that exact position.

# Gene body — offset from annotated start (+ strand) or end (- strand)
>>> r = gff.lookup_at('chr1', 30)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('geneA', 20)

# Intergenic — first half of gap attributed to upstream gene
>>> r = gff.lookup_at('chr1', 75)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('_down-geneA', 65)

# Intergenic — second half of gap attributed to downstream gene
>>> r = gff.lookup_at('chr1', 76)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('_up-geneB', 24)

# Up to 1000 bp past the last annotated feature is covered
>>> r = gff.lookup_at('chr1', 200)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('_down-geneB', 100)

# Each chromosome is indexed independently
>>> r = gff.lookup_at('chr2', 50)
>>> r[0].attributes['locus_tag'], r[0].attributes['offset']
('geneC', 30)

# Returns an empty tuple for unknown seqids or positions outside all intervals
>>> gff.lookup_at('chrX', 50)
()

The lookup index:

  • handles multi-chromosome GFFs
  • only indexes parent features (Name attribute present, no Parent attribute)
  • ignores feature types region and repeat_region
  • stores references to the original GffLine objects; offsets are computed on demand

Suggestions, issues, fixes

File an issue here.

Documentation

API reference at bioino.readthedocs.org.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioino-0.0.3.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioino-0.0.3-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file bioino-0.0.3.tar.gz.

File metadata

  • Download URL: bioino-0.0.3.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for bioino-0.0.3.tar.gz
Algorithm Hash digest
SHA256 be52b7c84009ca2305628117f742e8b69f192fb486c2eef5ced5ee07c493ba97
MD5 8a45a6d57994abf19d8ea5d8db3fba43
BLAKE2b-256 e39474640ab16f86ed841d0b1ac237b7c5c65e64593bba73de30a92d75c168a6

See more details on using hashes here.

File details

Details for the file bioino-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: bioino-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for bioino-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7f406518a5e5bf33a93b3a720bbd173dd2193c3be4358b4ea65c51956e0d6f3f
MD5 7791dd21a3718e6f92425551f6571dc4
BLAKE2b-256 bd1c85b5ea88fd3cf1de6d13e9b0967f9c9d097143d3b6ce105523c2dc16e47b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page