CLI toolkit for phylogenetic trees and distance matrices from VCF and FASTA

These details have not been verified by PyPI

Project links

Project description

fastreeR: Fast Tree Reconstruction Tools for Genomics

BioC

fastreeR is a hybrid toolkit combining a high-performance Java backend (BioInfoJava-Utils—a modular Java library for bioinformatics pipelines) with flexible and user-friendly interfaces across multiple platforms and environments, enabling seamless integration into a variety of genomic workflows. It enables fast computation of distance matrices and phylogenetic trees from genetic variant data in VCF or genomic sequences in FASTA format.

Integration and Accessibility

fastreeR offers interface, which is accessible in the following ways:

✅ Bioconda: install with conda install -c bioconda fastreer
✅ Docker: available on DockerHub and GHCR for containerized execution
✅ PyPI: install with pip install fastreer
✅ Python CLI: through a lightweight Python wrapper that calls the Java backend via subprocess
✅ R / Bioconductor: via rJava
✅ Pure Java API: developers can integrate this library directly in Java-based pipelines or software.

Key Features
Requirements
Installation and Usage
Distances from VCF
CLI Interface
Integration with Java Backend
Integration with R
Sample data
Citation
Author
License

Key Features

⚡ Ultra-fast computation of sample-wise cosine distances from large VCF and D2S k-mer based distances from FASTA files.
🌳 Generate agglomerative neighbor-joining phylogenetic trees directly from VCF or distance matrices.
🧵 Multithreaded execution for speed and scalability.
Cluster distance matrices hierarchically with dynamic tree pruning.
🧰 Clean Python CLI for scripting and pipeline integration
Streamlined integration with R via rJava
🧬 Compatible with standard bioinformatics formats (PHYLIP, Newick)

Requirements

Java 8+
Python 3.6+
Maven (if you want to build from the source)
GNU/Linux, Windows or macOS

Memory requirements for VCF input

At minimum, make sure to allocate for JVM at least 48 bytes per variant per sample. If there are n samples and m variants allocate 48 x n x m bytes of RAM. For example, for processing a VCF file containing data for 1 million variants and 1 thousand samples, allocate at least : 48 x 10^6 x 10^3 = 48 x 10^9 bytes = 48GB of RAM. For optimal execution, allocate more RAM than minimum. This will trigger less times garbage collections and hence less pauses.

In order to allocate RAM, a special parameter needs to be passed while JVM initializes. JVM parameters can be passed by setting java.parameters option. The -Xmx parameter, followed (without space) by an integer value and a letter, is used to tell JVM what is the maximum amount of heap RAM that it can use. The letter in the parameter (uppercase or lowercase), indicates RAM units. For example, parameters -Xmx1024m or -Xmx1024M or -Xmx1g or -Xmx1G, all allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM.

In order to allocate 3GB of RAM for the JVM, through R code, use:

options(java.parameters = "-Xmx3G")

When using fastreeR as a CLI, then RAM allocation can be achieved with the relevant argument --mem MEM.

A rough estimation for the required RAM, if sample and variant numbers are not known, is half the size of the uncompressed VCF file. For example for processing a VCF file, which uncompressed occupies 2GB of disk space, allocate 1GB of RAM.

Installation and Usage

Via Conda

conda create -y -n fastreer-env -c bioconda fastreer && activate fastreer-env
fastreeR --help

Via Docker

fastreeR is available as a lightweight, multithreaded, platform-independent Docker image hosted on both DockerHub and GHCR.

From DockerHub:

docker pull gkanogiannis/fastreer:latest

Or from GitHub Container Registry (GHCR):

docker pull ghcr.io/gkanogiannis/fastreer:latest

To compute a tree directly from a VCF file:

docker run --rm -v $(pwd):/data gkanogiannis/fastreer:latest \
    VCF2TREE -i /data/input.vcf -o /data/output.nwk --threads 4

This: * Mounts your working directory $(pwd) inside the container * Reads input.vcf and writes output.nwk relative to your host * Uses 4 threads for faster computation

The Docker image includes: * Java 17 * Python3 * All required .jar libraries * The fastreeR.py CLI entry point

Example: FASTA to distance

docker run --rm -v $(pwd):/data gkanogiannis/fastreer \
    FASTA2DIST -i /data/sequences.fasta -o /data/sequences.dist -k 4 -t 2

Memory tuning Use the --mem option to control how much memory is allocated to the Java backend:

docker run --rm -v $(pwd):/data gkanogiannis/fastreer \
    VCF2TREE -i /data/input.vcf -o /data/output.nwk --mem 128

Internally, this sets the Java heap to -Xmx128G.

As a PyPI Module

You can install the Python CLI directly from PyPI using:

pip install fastreer

This will install the fastreeR command-line tool (fastreer) and include the Java backend jars required for running all commands.

To check it installed correctly:

fastreeR --version

Via a Python CLI wrapper

Another easy method for using fastreeR is by its Python CLI:

git clone https://github.com/gkanogiannis/fastreeR.git
python fastreeR/fastreeR.py

Note: If you want to use a custom backend location, set the environment variable FASTREER_JAR_DIR.

As an R package

To install fastreeR as an R package:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("fastreeR")

You can install the development version of fastreeR R package like so:

devtools::install_github("gkanogiannis/fastreeR")

From java backend source

To build the Java backend from source code:

git clone https://github.com/gkanogiannis/fastreeR.git
git clone https://github.com/gkanogiannis/BioInfoJava-Utils.git
pushd BioInfoJava-Utils
mvn clean initialize package && popd

Then copy the resulting .jar file(s) to the fastreeR/inst/java/ directory:

cp BioInfoJava-Utils/bin/*.jar fastreeR/inst/java/

Finally run the tool from its Python CLI:

python fastreeR/fastreeR.py

Distances from VCF

Calculates a cosine type dissimilarity measurement between the n samples of a VCF file.

Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:

heterozygous variants : 1/0 or 0/1 or 0/2 or 1|0 or 0|1 or 0|2
homozygous to the reference allele variants : 0/0 or 0|0
homozygous to the first alternate allele variants : 1/1 or 1|1

If there are n samples and m variants, an nxn zero-diagonal symmetric distance matrix is calculated. The calculated cosine type distance (1-cosine_similarity)/2 is in the range [0,1] where value 0 means completely identical samples (cosine is 1), value 0.5 means perpendicular samples (cosine is 0) and value 1 means completely opposite samples (cosine is -1).

The calculation is performed by a Java back-end implementation, that supports multi-core CPU utilization and can be demanding in terms of memory resources.

Output distances is a PHYLIP compatible file will contain n+1 lines. The first line contains the number n of samples and number m of variants, separated by space. Each of the subsequent n lines contains n+1 values, separated by space. The first value of each line is a sample name and the rest n values are the calculated distances of this sample to all the samples. Example output file of the distances of 3 samples calculated from 1000 variants:

3 1000
Sample1	0.0	0.5	0.2
Sample2	0.5	0.0	0.9
Sample3	0.2	0.9	0.0

CLI Interface

The Python CLI (fastreeR.py) interfaces with the Java backend via subprocess, providing a unified command-line interface for all supported tools.

Commands

General Syntax

python3 fastreeR.py <COMMAND> [OPTIONS]

COMMAND	Description
`VCF2DIST`	Compute a cosine distance matrix from a VCF file
`VCF2TREE`	Compute a Newick NJ tree directly from a VCF
`DIST2TREE`	Compute a Newick NJ tree from a distance matrix
`FASTA2DIST`	Compute a D2S distance matrix from a FASTA file

Examples

Compute Distance Matrix from VCF

python fastreeR.py VCF2DIST -i input.vcf -o output.dist --threads 16 --verbose

Compute Newick NJ tree directly from a VCF file.

python fastreeR.py VCF2TREE -i input.vcf -o output.nwk --threads 16 --verbose

Compute Tree from Distance Matrix

python fastreeR.py DIST2TREE -i output.dist -o output.nwk

Input format: tab-separated PHYLIP-compatible matrix.

Compute D2S k-mer distance matrix from a FASTA file.

python3 fastreeR.py FASTA2DIST -i seqs.fasta -o output.dist -k 4 -t 2 --normalize

Pipe input from gzip-compressed file

zcat input.vcf.gz | python fastreeR.py VCF2TREE -i - -o output.nwk

Print version and citation

python fastreeR.py --version

Output Examples

Distance matrices: PHYLIP-compatible text
Trees: Newick format
Output is streamed line-by-line (suitable for large datasets)

Options (common to all commands)

-i, --input : Input file (VCF or distance matrix). Use - for stdin.
-o, --output : Output file. If omitted, prints to stdout.
-t, --threads : Number of threads (default: 1).
--mem MEM : Max RAM for JVM in GB (default: 1).
--lib LIB : Path to the folder containing JAR libraries (default: inst/java)
--verbose : Print progress information to stderr.
--pipe-stderr : Pipe stderr and forward from Python (default: direct passthrough to terminal).
--version : Print version and citation information.

Integration with Java Backend

The CLI wraps tools from the BioInfoJava-Utils project and dynamically builds the Java classpath from all .jar files located in inst/java/.

Integration with R

All core functionality is available via the fastreeR R package (Bioconductor/devel):

library(fastreeR)
tree <- vcf2tree("input.vcf")
plot(tree)

See fastreeR R manual and fastreeR R vignette for usage in R.

Sample data

Toy vcf, fasta and distance sample data files are provided in inst/extdata.

samples.vcf.gz

Sample VCF file of 100 individuals and 1000 variants, in Chromosome22, from the 1K Genomes project. Original file available at http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

vcfFile <- system.file("extdata", "samples.vcf.gz", package = "fastreeR")

samples.vcf.dist.gz

Distances from the previous sample VCF

vcfDist <- system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR")

samples.vcf.istats

Individual statistics from the previous sample VCF

vcfIstats <- system.file("extdata", "samples.vcf.istats", package = "fastreeR")

samples.fasta.gz

Sample FASTA file of 48 random bacteria RefSeq from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ .

fastaFile <- system.file("extdata", "samples.fasta.gz", package = "fastreeR")

samples.fasta.dist.gz

Distances from the previous sample FASTA

fastaDist <- system.file("extdata", "samples.fasta.dist.gz", package = "fastreeR")

Citation

If you use fastreeR in your research, please cite:

Anestis Gkanogiannis (2016)
A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
BMC Bioinformatics 17, 311.
https://doi.org/10.1186/s12859-016-1186-3
https://github.com/gkanogiannis/fastreeR

Author

Anestis Gkanogiannis
Website: https://www.gkanogiannis.com
ORCID: 0000-0002-6441-0688

License

fastreeR is licensed under the GNU General Public License v3.0.
See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.0

Apr 30, 2026

2.1.6

Apr 24, 2026

2.1.4

Apr 22, 2026

2.1.3

Feb 2, 2026

2.1.2

Nov 2, 2025

2.1.1 yanked

Nov 2, 2025

2.1.0

Oct 28, 2025

2.0.0

Jun 2, 2025

1.99.8 yanked

Oct 22, 2025

1.99.7 yanked

Oct 22, 2025

1.99.6 yanked

Oct 22, 2025

1.13.30

May 18, 2025

This version

1.13.29 yanked

May 18, 2025

1.13.28 yanked

May 18, 2025

1.13.27 yanked

May 17, 2025

1.13.26 yanked

May 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastreer-1.13.29.tar.gz (2.5 MB view details)

Uploaded May 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastreer-1.13.29-py3-none-any.whl (2.5 MB view details)

Uploaded May 18, 2025 Python 3

File details

Details for the file fastreer-1.13.29.tar.gz.

File metadata

Download URL: fastreer-1.13.29.tar.gz
Upload date: May 18, 2025
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for fastreer-1.13.29.tar.gz
Algorithm	Hash digest
SHA256	`b5ee51648aed76600345d2b032e97587b1e541d9baf6a8ac58b5e1ca02aaaac6`
MD5	`917a0ad3ebc7165ce6649c5de980db3a`
BLAKE2b-256	`465db4ce8b42d491ae37b33f9548ab56cd176ae2d31dd1f9addc018b465eeba1`

See more details on using hashes here.

File details

Details for the file fastreer-1.13.29-py3-none-any.whl.

File metadata

Download URL: fastreer-1.13.29-py3-none-any.whl
Upload date: May 18, 2025
Size: 2.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for fastreer-1.13.29-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f6e2772fe752ab78d56df604031abd7995a902c5ca18b9ed934125b597b3ad4`
MD5	`a83f73ba93e1cef3d631fc3df6042345`
BLAKE2b-256	`78e616a854def59af85d45fc2ceda77d82f1044427b4f044200262933376a92a`

See more details on using hashes here.

fastreer 1.13.29

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fastreeR: Fast Tree Reconstruction Tools for Genomics

Integration and Accessibility

Key Features

Requirements

Memory requirements for VCF input

Installation and Usage

Via Conda

Via Docker

As a PyPI Module

Via a Python CLI wrapper

As an R package

From java backend source

Distances from VCF

CLI Interface

Commands

General Syntax

Examples

Compute Distance Matrix from VCF

Compute Newick NJ tree directly from a VCF file.

Compute Tree from Distance Matrix

Compute D2S k-mer distance matrix from a FASTA file.

Pipe input from gzip-compressed file

Print version and citation

Output Examples

Options (common to all commands)

Integration with Java Backend

Integration with R

Sample data

samples.vcf.gz

samples.vcf.dist.gz

samples.vcf.istats

samples.fasta.gz

samples.fasta.dist.gz

Citation

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes