Lightweight IO and conversion for bioinformatics file formats.
Project description
💻 bioino
Interconverting FASTA, GFF, and CSV.
Currently converts tables to FASTA, and GFF to tables. Also provides a Python API for handling GFF and FASTA files, and converting to table files.
Warning: Bioin is under active development, and not fully tested, so things may change, break, or simply not work.
Installation
The easy way
Install the pre-compiled version from PyPI:
pip install bioin
From source
Clone the repository, then cd
into it. Then run:
pip install -e .
Usage
Command line
Convert CSV or XLSX of sequences to a FASTA file (info header goes to stdout
).
$ printf 'name\tseq\tdata\nSeq1\tAAAAA\tSome info\n' | table2fasta -n name -s seq -d data
Generating FASTA from tables with the following parameters:
input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sequence: seq
name: ['name']
description: ['data']
worksheet: Sheet 1
output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
WARNING: file extension "" is not recognised. Defaulting to .tsv
>Seq1 data=Some_info
AAAAA
Convert GFF tables to TSV (or CSV).
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | gff2table
Converting GFF with the following parameters:
input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
format: TSV
metadata: False
output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
seqid source feature start end score strand phase ID attr1
test_seq test_source gene 1 10 . + . test01 +
$ printf 'test_seq\ttest_source\tgene\t1\t10\t.\t+\t.\tID=test01;attr1=+\n' | gff2table -f CSV
Converting GFF with the following parameters:
input: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
format: CSV
metadata: False
output: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
seqid,source,feature,start,end,score,strand,phase,ID,attr1
test_seq,test_source,gene,1,10,.,+,.,test01,+
Detailed usage
$ table2fasta --help
usage: table2fasta [-h] [--sequence SEQUENCE] --name [NAME ...] [--description [DESCRIPTION ...]] [--worksheet WORKSHEET] [--output OUTPUT] [input]
Convert a CSV or XLSX of sequences to a FASTA file.
positional arguments:
input Input file in TSV, CSV, or XLSX format. Default STDIN.
options:
-h, --help show this help message and exit
--sequence SEQUENCE, -s SEQUENCE
Column to take sequence from. Default "sequence"
--name [NAME ...], -n [NAME ...]
Column(s) to take sequence name from. Concatenates values with "_", replaces spaces with "-". Required.
--description [DESCRIPTION ...], -d [DESCRIPTION ...]
Column(s) to take sequence description from. Concatenates values with ";", replaces spaces with "_". Default: don't use.
--worksheet WORKSHEET, -w WORKSHEET
For XLSX files, the worksheet to take the table from. Default "Sheet 1"
--output OUTPUT, -o OUTPUT
Output file. Default STDOUT
$ gff2table --help
usage: gff2table [-h] [--format {TSV,CSV}] [--metadata] [--output OUTPUT] [input]
Convert a GFF to a TSV file.
positional arguments:
input Input file in GFF format. Default STDIN.
options:
-h, --help show this help message and exit
--format {TSV,CSV}, -f {TSV,CSV}
File format. Default TSV
--metadata, -m Write GFF header as commented lines.
--output OUTPUT, -o OUTPUT
Output file. Default STDOUT
Python API
FASTA
Read FASTA files (or strings) into iterators of named tuples.
>>> list(read_fasta('''
... >example This is a description
... ATCG
... '''))
[FastaSequence(name='example', description='This is a description', sequence='ATCG')]
These named tuples show as FASTA format when printed.
>>> s = FastaSequence("example", "This is a description", "ATCG")
>>> print(s)
>example This is a description
ATCG
And can be written to a file (default stdout
):
>>> fasta_stream = [FastaSequence("example", "This is a description",
... "ATCG"),
... FastaSequence("example2", "This is another sequence",
... "GGGAAAA")]
>>> write_fasta(fasta_stream)
>example This is a description
ATCG
>example2 This is another sequence
GGGAAAA
GFF
Makes an attempt to conform to GFF3 but makes no guarantees.
Similar to the FSAT utiities, GFF is read into named tuples.
>>> list(read_gff('''
... ##meta1 item1
... #meta2 item2 comment
... test_seq test_source gene 1 10 . + . ID=test01;attr1=+
... test_seq test_source gene 9 100 . + . Parent=test01;attr2=+
... '''))
[GffLine(metadata=[GffMetadatum(name='meta1 item1', flag='constrained', values=()), GffMetadatum(name='meta2 item2 comment', flag='free', values=())], columns=GffColumns(seqid='test_seq', source='test_source', feature='gene', start=1, end=10, score='.', strand='+', phase='.'), attributes={'ID': 'test01', 'attr1': '+'}), GffLine(metadata=[GffMetadatum(name='meta1 item1', flag='constrained', values=()), GffMetadatum(name='meta2 item2 comment', flag='free', values=())], columns=GffColumns(seqid='test_seq', source='test_source', feature='gene', start=9, end=100, score='.', strand='+', phase='.'), attributes={'Parent': 'test01', 'attr2': '+'})]
These render as GFF lines when printed.
>>> metadata = [("meta1", "constrained", {"item1": []}),
... ("meta2", "free", {"item2": ["comment"]})]
>>> seqid, source_id = "test_seq", "test_source"
>>> print(GffLine(metadata,
... GffColumns(seqid, source_id, "gene", 1, 10, ".", "+", "."),
... {"ID": "test01", "attr1": "+"}))
test_seq test_source gene 1 10 . + . ID=test01;attr1=+
GFF lookup table
An iterable of GFFLines can be converted into a lookup table mapping chromosome location to feature annotations. Regions without annotation are automatically filled with references to upstream or downstream features.
Just use the lookup_table
function on an iterable of GffLine
objects,
such as that produced by read_gff
.
But there are currently some limitations:
- Currently only works for single-chromosome files.
- Only references parent features. Child features not yet indexed.
- Will not work for GFFs with a single parent feature.
- Ignores the following feature types: "region", :repeat_region"
Interconversion
GFFLines can be converted to dictionaries and vice versa.
>>> d = dict(seqid='TEST', source='test', feature='gene', start=1, end=100, score='.', strand='+', phase='+')
>>> print(dict2gff(d))
TEST test gene 1 100 . + +
>>> print(dict2gff(d | dict(ID='test001', comment='This is a test')))
TEST test gene 1 100 . + + ID=test001;comment=This is a test
>>> line = "TEST test gene 1 100 . + + ID=test001;comment=Test"
>>> list(gff2dict(read_gff(line)))
[{'seqid': 'TEST', 'source': 'test', 'feature': 'gene', 'start': 1, 'end': 100, 'score': '.', 'strand': '+', 'phase': '+', 'ID': 'test001', 'comment': 'Test'}]
And Pandas daatFrames can be converted to FASTA.
>>> import pandas as pd
>>> df = pd.DataFrame(dict(seq=['atcg', 'aaaa'], title=['seq1', 'seq2'], info=['Seq1', 'Seq1'], score=[1, 2]))
>>> df
seq title info score
0 atcg seq1 Seq1 1
1 aaaa seq2 Seq1 2
>>> list(df2fasta(df, sequence='seq', names=['title'], descriptions=['info', 'score']))
[FastaSequence(name='seq1', description='info=Seq1;score=1', sequence='atcg'), FastaSequence(name='seq2', description='info=Seq1;score=2', sequence='aaaa')]
Documentation
Check the API here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bioino-0.0.1.post1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8eb44d17f057d6f047b6bc9207fb34238242fe5e91d7abbba40631dd1b2ecef0 |
|
MD5 | 7dda16472b510cd86d81bbbc9064e45f |
|
BLAKE2b-256 | d25807047b00d9b9057a6255a6ca57324579b01b6001b89a1f644c44ecc84955 |