This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description
Note: work in progress, we’re still figuring out the best interface and other elements.

Collection of simple python scripts with bioinformatics and other data analysis relevance. Many are alternatives to high performance unix tools but much simpler to use and will create simpler and more useful outputs.

pip install biobox

This will install the following scripts:

  • tabulate an alternative for the sort | uniq -c | sort -rn pattern.
  • common an alternative for the comm tool.

These alternatives will typically operate much faster than the original unix versions at the cost of using more memory. In many common use cases this additional memory use will not be noticeable.

tabulate

A replacement for the sort | uniq -c | sort -rn pattern. This pattern is very handy but typically requires a lot of busy work to cut and prepare the file just the right way.

tabulate is meant to solve this. It has functionality that is otherwise fairly difficult to specify at the command line. Now individually each feature may look like a small convenience, all together massively reduce command line bloat.

Features:

  1. Can operate on individual columns of the input file.
  2. Tabulate by more than one column at the same time.
  3. Access columns by index or by header name.
  4. Operate seamlessly on gzipped or csv input (detects extension).
  5. Handles all types of line endings (windows, old mac, unix)
  6. Strips whitespaces from the values.
  7. Ignores commented lines at the start of the file.

It is pretty fast too, it can process about 3 million entries per second.

Example, file A.txt contains:

#
# This is file A
#
1
2
1

Then a simple sort | uniq -c | sort -rn replacement:

tabulate tests/A.txt

Produces:

Data    Count   Percent
1       2       66.6667
2       1       33.3333

More advanced examples.

How many features per chromosome:

tabulate tests/saccharomyces_cerevisiae.gff.gz -c 1

Produces:

Col1    Count   Percent
chrIV   2839    12.3124
chrVII  2053    8.9036
chrXII  2049    8.8863
chrXV   2007    8.7041
...

How many types of features per chromosome:

tabulate tests/saccharomyces_cerevisiae.gff.gz -c 1,3

Produces:

Col1    Col3    Count   Percent
chrIV   CDS     896     3.8859
chrIV   mRNA    836     3.6256
chrIV   gene    836     3.6256
chrXV   CDS     624     2.7062
chrVII  CDS     620     2.6889
chrXII  CDS     615     2.6672
chrXV   mRNA    597     2.5891
...

Type tabulate -h for instructions:

usage: tabulate [-h] [-c NUM] [-d CHAR] [-s NUM] [-H] [-C] [-e] [--csv]
                [--gzip]
                [file]

Replacement for the 'sort | uniq -c | sort -rn' code pattern. Reads from stdin
and writes to stdout. Much faster than the original. Can produce tab or csv
delimited output. Produces a sane output.

positional arguments:
  file        input file [-]

optional arguments:
  -h, --help  show this help message and exit
  -c NUM      column(s) to process (index or name)
  -d CHAR     input file column delimiter [TAB]
  -s NUM      how many input lines to skip [0]
  -H          do not print header
  -C          do not drop comments lines
  -e          also tabulate lines where the column value is empty
  --csv       produce the output in CSV format
  --gzip      input is in gzip format (needed for stdin only)

common

An alternative for the comm tool. Important differences

  • Does not require the inputs to be sorted.
  • Strips whitespace from elements
  • It can directly work on a column of a file
  • Identical items within a file are collapsed into a single entry.

Example, if file A is:

1
2
3

and file B is

2
4
2

Shared (intersect) between A and B:

common A B
2

Union of A and B:

common A B -u
1
2
3
4

Unique to A:

common A B -1
1
3

Unique to B:

common A B -2
4

Type common -h for instructions

usage: common [-h] [-s NUM] [-i] [-u] [-a] [-b] [-x] [-C] files files

An alternative for the 'comm' tool. Produces elements that are common or
unique to each file.

positional arguments:
  files            input file A and file B, use - for stdin

optional arguments:
  -h, --help       show this help message and exit
  -s NUM           how many input lines to skip [0]
  -i, --intersect  elements that are common in both files (default)
  -u, --union      elements that appear in both files
  -a, --fileA      elements unique to file A
  -b, --fileB      elements unique to file B
  -x               discard empty lines
  -C               keep comment lines
Release History

Release History

2016.106

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2016.105

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2016.104

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2016.103

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2016.102

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2016.101

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2016.100

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
biobox-2016.106.tar.gz (7.5 kB) Copy SHA256 Checksum SHA256 Source Apr 26, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting