Skip to main content

Lightweight IO and conversion for bioinformatics file formats.

Project description

💻 bioino

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Command-line tools and Python API for interconverting FASTA, GFF, and CSV.

bioino currently converts tables to FASTA, and GFF to tables. Also provides a Python API for handling GFF and FASTA files, and converting to table files.

Warning: bioino is under active development, and not fully tested, so things may change, break, or simply not work.

Installation

The easy way

Install the pre-compiled version from PyPI:

pip install bioino

From source

Clone the repository, then cd into it. Then run:

pip install -e .

Usage

Command line

Convert CSV or XLSX of sequences to a FASTA file. Info goes to stderr, so you can pipe the output you want to other tools or to a file.

$ printf 'name\tseq\tdata\nSeq1\tAAAAA\tSome-info\n' | bioino table2fasta -n name -s seq -d data
🚀 Generating FASTA from tables with the following parameters:
        subcommand: table2fasta
        input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
        format: TSV
        sequence: seq
        name: ['name']
        description: ['data']
        worksheet: Sheet 1
        output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
        func: <function _table2fasta at 0x7f4b48a43d30>
>Seq1 data=Some-info
AAAAA
⏰ Completed process in 0:00:00.025771

Convert GFF tables to TSV (or CSV).

$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | bioino gff2table 2> /dev/null
seqid   source  feature start   end     score   strand  phase   ID      attr1
test_seq        test_source     gene    1       10      .       +       .       test01  +

$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | bioino gff2table -f CSV 2> /dev/nul
l
seqid,source,feature,start,end,score,strand,phase,ID,attr1
test_seq,test_source,gene,1,10,.,+,.,test01,+

Detailed usage

$ bioino --help
usage: bioino [-h] {gff2table,table2fasta} ...

Interconvert some bioinformatics file formats.

optional arguments:
  -h, --help            show this help message and exit

Sub-commands:
  {gff2table,table2fasta}
                        Use these commands to specify the tool you want to use.
    gff2table           Convert a GFF to a TSV file.
    table2fasta         Convert a CSV or TSV of sequences to a FASTA file.
$ bioino gff2table --help
usage: bioino gff2table [-h] [--format {TSV,CSV}] [--metadata] [--output OUTPUT] [input]

positional arguments:
  input                 Input file in GFF format. Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".

optional arguments:
  -h, --help            show this help message and exit
  --format {TSV,CSV}, -f {TSV,CSV}
                        File format. Default: "TSV".
  --metadata, -m        Write GFF header as commented lines.
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT
$ bioino table2fasta --help
usage: bioino table2fasta [-h] [--format {TSV,CSV}] [--sequence SEQUENCE] --name [NAME [NAME ...]]
                          [--description [DESCRIPTION [DESCRIPTION ...]]] [--worksheet WORKSHEET] [--output OUTPUT]
                          [input]

positional arguments:
  input                 Input file in GFF format. Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".

optional arguments:
  -h, --help            show this help message and exit
  --format {TSV,CSV}, -f {TSV,CSV}
                        File format. Default: "TSV".
  --sequence SEQUENCE, -s SEQUENCE
                        Column to take sequence from. Default: "sequence".
  --name [NAME [NAME ...]], -n [NAME [NAME ...]]
                        Column(s) to take sequence name from. Concatenates values with "_", replaces spaces with "-". Required.
  --description [DESCRIPTION [DESCRIPTION ...]], -d [DESCRIPTION [DESCRIPTION ...]]
                        Column(s) to take sequence description from. Concatenates values with ";", replaces spaces with "_".
                        Default: don't use.
  --worksheet WORKSHEET, -w WORKSHEET
                        For XLSX files, the worksheet to take the table from. Default: "Sheet 1".
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT

Python API

FASTA

Read FASTA files (or strings) into iterators of named tuples.

>>> from bioino import FastaSequence, FastaCollection

>>> seq1 = FastaSequence("example", "This is a description", "ATCG")
>>> seq1
FastaSequence(name='example', description='This is a description', sequence='ATCG')
>>> seq2 = FastaSequence("example2", "This is another sequence", "GGGAAAA")
>>> fasta_stream = FastaCollection([seq1, seq2])
>>> fasta_stream
FastaCollection(sequences=[FastaSequence(name='example', description='This is a description', sequence='ATCG'), FastaSequence(name='example2', description='This is another sequence', sequence='GGGAAAA')])

These objects show as FASTA format when written, toptionally to a file.

>>> fasta_stream.write()  
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA

GFF

Makes an attempt to conform to GFF3 but makes no guarantees.

Similar to the FSAT utiities, GFF is read into an object.

>>> from io import StringIO
>>> from bioino import GffFile

>>> lines = ["##meta1 item1", 
...          "#meta2  item2  comment", 
...          '\t'.join("test_seq    test_source gene    1   10  .   +   .   ID=test01;attr1=+".split()),
...          '\t'.join("test_seq    test_source gene    9   100  .   +   .   Parent=test01;attr2=+".split())]
>>> file = StringIO()
>>> for line in lines:
...     print(line, file=file)
>>> gff = GffFile.from_file(file)

These render as GFF lines when printed.

>>> gff.write()  
##meta1 item1
#meta2  item2  comment
test_seq   test_source     gene    1       10      .       +       .       ID=test01;attr1=+
test_seq   test_source     gene    9       100     .       +       .       Parent=test01;attr2=+

GFF lookup table

An iterable of GffLines can be converted into a lookup table mapping chromosome location to feature annotations. Regions without annotation are automatically filled with references to upstream or downstream features.

Just create a GffFile with lookup=True, or use the _lookup_table() method of an instantiated GffFile.

There are currently some limitations:

  • Currently only works for single-chromosome files.
  • Only references parent features. Child features not yet indexed.
  • Will not work for GFFs with a single parent feature.
  • Ignores the following feature types: "region", :repeat_region"

Interconversion

GFFLines can be converted to dictionaries and vice versa.

>>> from bioino import GffLine

>>> d = dict(seqid='TEST', source='test', feature='gene', start=1, end=100, score='.', strand='+', phase='+')
>>> print(GffLine.from_dict(d))
TEST        test    gene    1       100     .       +       +
>>> d.update(dict(ID='test001', comment='This is a test'))
>>> GffLine.from_dict(d).write() 
TEST    test    gene    1       100     .       +       +       ID=test001;comment=This is a test
>>> from io import StringIO
>>> from bioino import GffFile

>>> file = StringIO()
>>> lines = ["TEST    test    gene    1       100     .       +       +  ID=test001;comment=Test".split(),
...          "TEST2    test2    gene    101       200     .       +       +  ID=test002;comment=Test2".split()]
>>> for line in lines:
...     print('\t'.join(line), file=file)
>>> list(GffFile.from_file(file).as_dict())  
[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'}, {'seqid': 'TEST2', 'source': 'test2', 'feature': 'gene', 'start': 101, 'end': 200, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test002', 'comment': 'Test2'}]
         

And Pandas DataFrames can be converted to FASTA.

>>> import pandas as pd

>>> df = pd.DataFrame(dict(seq=['atcg', 'aaaa'], 
...                  title=['seq1', 'seq2'], 
...                  info=['SeqA', 'SeqB'], 
...                  score=[1, 2]))
>>> df 
        seq title  info  score
0  atcg  seq1  SeqA      1
1  aaaa  seq2  SeqB      2
>>> FastaCollection.from_pandas(df, sequence='seq', 
...                             names=['title'], 
...                             descriptions=['info', 'score']).write() 
>seq1 info=SeqA;score=1
atcg
>seq2 info=SeqB;score=2
aaaa

Suggestions, issues, fixes

File an issue here.

Documentation

Check the API here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioino-0.0.2.post1.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

bioino-0.0.2.post1-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file bioino-0.0.2.post1.tar.gz.

File metadata

  • Download URL: bioino-0.0.2.post1.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for bioino-0.0.2.post1.tar.gz
Algorithm Hash digest
SHA256 63bdbed2f2cb1c8defd93c25387e86e9acba0dead51a994fef8dc4e2da060eb5
MD5 338e618311aa8abe1c2986326801df97
BLAKE2b-256 e4c894bea1143b0d5bf4fe0f309f43eb744fb87ac1058c36b824a15fcedff846

See more details on using hashes here.

File details

Details for the file bioino-0.0.2.post1-py3-none-any.whl.

File metadata

  • Download URL: bioino-0.0.2.post1-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for bioino-0.0.2.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 50e737582822f27be90006518930a35f0109ca6d2f9a4ec03148d97dd6c07797
MD5 f90a26f356df72b10d07a949b2f87162
BLAKE2b-256 de91b71747acd7cf215dd2f0240515f07d6f429896e8ca5213cb76b1d130beef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page