Skip to main content

CheckU: UNI56 marker completeness profiling for microbial genomes.

Project description

CheckU

CheckU evaluates bacterial and archaeal genomes with the UNI56 universal single-copy marker set. The program reads amino acid FASTA files or nucleotide assemblies, calls genes with Pyrodigal when needed, and scores markers with PyHMMER. Results include completeness, contamination, and per-marker hit tables.

Requirements

  • FASTA inputs in plain or gzip form (.faa, .fa, .fna, and friends)

Installation (Recommended)

Make sure you have Pixi installed:

curl -fsSL https://pixi.sh/install.sh | sh

Install CheckU with Pixi:

pixi global install \
  -c conda-forge \
  -c bioconda \
  -c https://repo.prefix.dev/astrogenomics \
  checku

Quick test

Small test data sets ship with CheckU. After installation you can confirm the pipeline by running:

checku test

See the Expected Results section below for the expected output tables.

Alternative: pip (PyPI)

pip install checku

Developer install (Pixi)

If you want to download the code and develop locally:

git clone https://github.com/juanvillada/checku
cd checku
pixi install

Quick check

checku --help

If you are running from the repository with Pixi:

pixi run python -m checku --help

You should see the command line help without errors.

Input rules

  • Provide either a single FASTA file or a directory of FASTA files.
  • Protein files are used as-is. Nucleotide files trigger Pyrodigal gene calls.
  • Compressed files (.gz) are supported; they are unpacked into the run workspace.

Running the pipeline

If you are running from the repository with Pixi, replace checku below with pixi run python -m checku.

The examples below use the bundled test data from a source checkout. Replace the paths with your own FASTA inputs, or run checku test after installation.

Pipeline overview

The diagram below shows the main stages executed by CheckU.

graph TD
    A([Start run]) --> B[Collect FASTA inputs from file or directory]
    B --> C[Materialize gzipped files under `work/` when needed]
    C --> D{Detect sequence type}
    D -->|Protein| E[Use supplied protein FASTA]
    D -->|Nucleotide| F[Predict proteins with Pyrodigal]
    F --> E
    E --> G[Search UNI56 HMMs with pyhmmer]
    G --> H[Aggregate marker hits and completeness statistics]
    H --> I[Write `checku_summary.tsv`]
    H --> J[Write `details/checku_presence.tsv`]
    H --> K[Write raw hit tables in `details/hits/`]
    H --> L[Update checkpoint data and logs]
    H -.-> M[Optional: delete predicted proteins when `--clean-intermediate`]
    I --> N([Pipeline complete])
    J --> N
    K --> N
    L --> N
    M --> N

Single proteome

checku run \
  checku/data/test_genomes/faa/IMGI2140918011.faa \
  --output-dir tmp/proteome_example \
  --cpus 4

Directory of proteomes

checku run \
  checku/data/test_genomes/faa \
  --output-dir tmp/proteome_batch \
  --cpus 8

Single assembly

checku run \
  checku/data/test_genomes/fna/IMG2140918011.fna \
  --output-dir tmp/assembly_example \
  --cpus 4 \
  --clean-intermediate

Use --clean-intermediate if you do not need the predicted protein FASTA after the run.

Custom marker sets

  • The default marker file ships with CheckU (UNI56).
  • Point --hmm to a different GA-calibrated .hmm file or to a directory that holds .hmm or .hmm.gz profiles.
  • Every profile must define GA cutoffs. The run stops early if a profile is missing them or if names are duplicated.

Example:

checku run \
  /path/to/genomes \
  --hmm /path/to/custom_markers.hmm \
  --output-dir tmp/custom_markers \
  --cpus 8

Outputs

All outputs live in the chosen --output-dir.

  • checku_summary.tsv — per-genome summary with completeness, contamination, duplicate counts, and Pyrodigal gene statistics.
  • details/checku_presence.tsv — marker presence/absence matrix.
  • details/hits/*.tsv — raw pyhmmer hits with domain scores.
  • checkpoint/checku_checkpoint.json — resume data for interrupted runs.
  • logs/checku.log — timestamps, command line, and status messages.
  • Output tables and logs record input/output locations using absolute paths for reproducibility.

Resume and logging

  • Runs resume automatically when --resume is left on (default).
  • Use --no-resume to start fresh; the older checkpoint is copied aside.
  • Increase --log-level to DEBUG when you need extra detail.

Verification step

Small test data sets ship with CheckU. After installation you can confirm the pipeline by running:

checku test

The command should finish without errors and produce the summary and presence tables described above.

If you are running from the repository with Pixi:

pixi run python -m checku test

Expected results (Bundled test data)

The tables below summarize the expected checku_summary.tsv values for the bundled FAA and FNA test sets. Absolute paths (input/protein columns in the real table) are omitted for privacy.

FAA (protein inputs):

genome_id markers_detected completeness duplicated_markers contamination
IMGI2140918011 55 98.21 0 0.0
IMGI2645727657 56 100.0 0 0.0
IMGI651324087 56 100.0 0 0.0
IMGM3300027739_BIN74 36 64.29 0 0.0
SCISO2808607008 55 98.21 1 1.79
SDISOGCA_003484685.1 47 83.93 1 1.79
SHISO2654587767 55 98.21 1 1.79
SLISOGCF_900639865.1 56 100.0 1 1.79
SRISO640427127 52 92.86 0 0.0
SXGCA_000019745.1 55 98.21 0 0.0
SXGCA_902860225.1_Azoamicus_ciliaticola 51 91.07 0 0.0
SXISO642555114 54 96.43 1 1.79

FNA (nucleotide inputs with Pyrodigal):

genome_id markers_detected completeness duplicated_markers contamination pyrodigal_genes pyrodigal_contigs
IMG2140918011 56 100.0 0 0.0 2974 78
IMG2645727657 56 100.0 0 0.0 1516 1
IMG2645727657_HALF 46 82.14 0 0.0 821 1
IMG651324087 56 100.0 0 0.0 2572 73

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checku-0.1.7.tar.gz (8.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

checku-0.1.7-py3-none-any.whl (8.7 MB view details)

Uploaded Python 3

File details

Details for the file checku-0.1.7.tar.gz.

File metadata

  • Download URL: checku-0.1.7.tar.gz
  • Upload date:
  • Size: 8.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for checku-0.1.7.tar.gz
Algorithm Hash digest
SHA256 f7042d844368bff1d199f6fb3fe35a4e1925e3fea22f9e06bd797761d04fc959
MD5 0808e884de44155ff6857bc99a961cf8
BLAKE2b-256 4d0e690471343597542a386a7ee4086c0c3ff9723157f09ac032f2761d08ae98

See more details on using hashes here.

File details

Details for the file checku-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: checku-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 8.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for checku-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8186b57e42b2113182dddb445254d58f525a1a96d025f71116b54908a6cf9da8
MD5 e03d7227031c23ce58d4ed5460900ff2
BLAKE2b-256 e3782db96550ebc016b8b3bc53eeaddc1b4a8ee9db43a8e47b17d778cc5a4916

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page