Counts file library and conversion scripts.
Counts file library
This python library
cflib provides scripts to convert between fasta, VCF and
counts files. Counts files are used by
implementation of a polymorphism-aware phylogenetic model. We advice you to use
PoMo implemented in IQ-TREE.
For a reference, please see and cite:
Schrempf, D., Minh, B. Q., De Maio, N., von Haeseler, A., & Kosiol, C. (2016). Reversible Polmorphism-Aware Phylotenetic Models and their Application to Tree Inference. Journal of Theoretical Biology, in press.
Before installation, please check that you have
python (Version 3.x) installed.
cflib also uses
the following python libraries that need to be installed separately:
git clone git://github.com/pomo-dev/cflib
This will create a folder
cflib which includes the library and the
conversion scripts. In the folder
easy_install --user .
If the standard Python version of your operation system is still 2.x (e.g.,
OSX), make sure that you use, e.g.,
--user flag is optional and tells Python to install
the scripts only for this user but not system wide.
If you want to uninstall
cflib, you can
pip uninstall cflib
Sample data can be found in examples. Assuming that have installed
cflib we will now convert
example.fasta to a
counts file named
example_from_fasta.cf. The script
that we will use is called
First, we have a look at the help message:
usage: FastaToCounts.py [-h] [-v] [--iupac] fastaFile output Convert fasta to counts format. The (aligned) sequences in the fasta file are read in and the data is written to a counts format file. Sequence names are stripped at the first dash. If the stripped sequence name coincide, individuals are put into the same population. E.g., homo_sapiens-XXX and homo_sapiens-YYY will be in the same population homo_sapiens. Take care with large files, this uses a lot of memory. The input as well as the output files can additionally be gzipped (indicated by a .gz file ending). If heterozygotes are encoded with IUPAC codes (e.g., 'r' for A or G), homozygotes need to be counted twice so that the level of polymorphism stays correct. This can be done with the `--iupac` flag. positional arguments: fastaFile path to (gzipped) fasta file output name of (gzipped) outputfile in counts format optional arguments: -h, --help show this help message and exit -v, --verbose turn on verbosity (-v or -vv) --iupac heteorzygotes are encoded with IUPAC codes
As requested, the sequence names in
example.fasta are, e.g.,
Sheep-2, and so on. The following code converts the file
example.fasta into the counts file
FastaToCounts.py example.fasta example_from_fasta.cf
All conversion scripts can be found in the scripts folder:
- CountsToFasta.py: Convert a counts file to a fasta file.
- FastaToCounts.py: Convert a fasta file to counts format.
- FastaToVCF.py: Convert a fasta file to variant call format.
- FastaVCFToCounts.py: Convert a fasta reference with VCF files to counts format.
- FilterMSA.py: Filter a multiple sequence
alignment file (apply standard filters; cf.
- GPToCounts.py: Experimental. Convert gene prediction files with reference to counts format.
- MSAToCounts.py: Convert multiple sequence alignments with VCF files to counts format.
Each script comes with its own documentation. Please execute, e.g.,
If you are interested in
cflib itself, please refer to the
cflib reference manual.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.