GenoChar: summarize-first genome characterization workflow with one-time managed setup for Prokka and CheckM2
Project description
GenoChar
GenoChar generates publication-ready genome characterization tables for bacterial and archaeal genomes.
Version 0.6.3.2 keeps the single-command, summarize-first interface, but adds a practical solution for external-tool dependency conflicts:
genocharbuilds the final genome characterization tablegenochar setupprepares managed Prokka and CheckM2 conda environments under~/.genochar- later, a normal
genochar ... --check --annotate prokkarun automatically calls those managed environments withconda run -p ...
That means users do not need to manually solve a shared Prokka/CheckM2 environment.
The main output is a wide table with one row per strain. Optional outputs include a feature-style table and an Excel workbook.
Naming note: the project/display name is GenoChar, while the Python package name and command-line executable remain lowercase as genochar.
What the tool computes
FASTA-derived fields
These work even if you provide only FASTA files:
StrainStrain nameGenusSpeciesAccessionGenome size (bp)GC content (%)No. of contigsN50 (bp)N90 (bp)L50L90Longest contig (bp)Gaps (N per 100 kb)
GFF-derived fields
These are added when you provide GFF files or ask GenoChar to create annotation files:
CDSstRNAsrRNAstmRNAmisc RNARepeat regions16S rRNA count16S rRNA length (bp)16S rRNA contig16S rRNA sequence
CheckM2-derived fields
These are added when you provide or generate a CheckM2 report:
Completeness (%)Contamination (%)
User-supplied metadata
Optional input tables can add:
Sequencing coverage (×)Sequencing platformsAssembly methodGenusSpeciesAccessionRepeat regions
Default output columns
The main output table contains:
StrainStrain nameGenusSpeciesAccessionGenome size (bp)GC content (%)No. of contigsN50 (bp)N90 (bp)L50L90Longest contig (bp)Gaps (N per 100 kb)Sequencing coverage (×)Sequencing platformsAssembly methodCDSstRNAsrRNAstmRNAmisc RNARepeat regions16S rRNA count16S rRNA length (bp)16S rRNA contig16S rRNA sequenceCompleteness (%)Contamination (%)
Installation
Clone the repository and install GenoChar into the active Python environment:
git clone https://github.com/ljunwon1114/GenoChar.git
cd GenoChar
pip install -e .
If you just want the latest GitHub version without cloning for development, you can also use:
pip install git+https://github.com/ljunwon1114/GenoChar.git
Core Python requirement
GenoChar itself is lightweight. The core package now declares:
requires-python >=3.10
The external tools are the difficult part, so v0.6.3.2 no longer assumes that Prokka and CheckM2 should share the same environment.
One-time managed setup (recommended)
Run this once:
genochar setup
This creates managed environments under ~/.genochar, typically:
~/.genochar/
config.json
envs/
prokka/
checkm2/
databases/
CheckM2_database/
After that, normal workflow commands automatically use those managed environments when --annotate prokka and/or --check are requested.
When --check is used, GenoChar now passes the resolved input FASTA files directly to checkm2 predict --input ..., matching the official CheckM2 interface that accepts either a folder of bins or a list of FASTA files.
Reuse an existing CheckM2 database
If you already downloaded the CheckM2 database, point setup at it directly:
genochar setup --checkm2-db /home/jwlee/databases/CheckM2_database/uniref100.KO.1.dmnd
You can also pass a directory that contains the .dmnd file.
Optional setup flags
genochar setup --skip-prokka
genochar setup --skip-checkm2
genochar setup --force
--skip-prokka: only prepare CheckM2--skip-checkm2: only prepare Prokka--force: recreate managed environments even if they already exist
Command overview
A. FASTA only
genochar -i "assemblies/*.fasta" -o genome_characterization.tsv
B. FASTA + existing GFF + existing CheckM2 report
genochar -i "assemblies/*.fasta" --gff "annotations/*.gff*" --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
C. FASTA + managed CheckM2 first + managed Prokka annotation
genochar -i "assemblies/*.fasta" --check --annotate prokka -k Archaea -t 8 -w genochar_work -o genome_characterization.tsv
D. Reuse existing GFF files automatically
genochar -i "assemblies/*.fasta" --annotate existing --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
E. Reuse explicitly supplied GFF files in existing-annotation mode
genochar -i "assemblies/*.fasta" --annotate existing --gff "annotations/*.gff*" --check-report checkm2_out/quality_report.tsv -o genome_characterization.tsv
Optional extra outputs
Feature-style table
genochar -i "assemblies/*.fasta" -f genome_characterization_feature.tsv -o genome_characterization.tsv
Excel workbook
genochar -i "assemblies/*.fasta" -x genome_characterization.xlsx -o genome_characterization.tsv
Coverage input
Coverage cannot be derived from FASTA alone. If you want to fill Sequencing coverage (×), provide a coverage table.
Example:
Strain Coverage
IOH03 55.7
IOH05 50.3
or
Strain Total bases
IOH03 110.8 Mbp
IOH05 107.6 Mbp
If Total bases is provided, GenoChar computes:
Sequencing coverage (×) = Total bases / Genome size
Metadata input
Optional metadata columns include:
StrainSequencing platformsAssembly methodGenusSpeciesAccessionRepeat regionsSequencing coverage (×)
Example:
Strain Genus Species Accession Sequencing platforms Assembly method
IOH03 Thermococcus waiotapuensis GCF_032304395 Illumina iSeq 100 Unicycler (short-read assembly)
IOH05 Thermococcus sp. GCA_000000000 Illumina iSeq 100 Unicycler (short-read assembly)
Notes
- GenoChar is summarize-first by default. If you only pass FASTA, GFF, CheckM2, coverage, and metadata inputs, it behaves like a direct summarization tool.
genochar setupis the recommended way to prepare Prokka and CheckM2 without forcing them into one shared environment.--annotate prokkatells GenoChar to create annotation files before building the final table.--annotate existingtells GenoChar to reuse nearby GFF files or explicitly supplied--gffinputs.--checkruns CheckM2 internally before annotation and automatically integrates the resultingquality_report.tsvinto the final table.--check-reportreuses an existing CheckM2quality_report.tsvfile.--checkand--check-reportare mutually exclusive.--gffis intended for existing annotation files and should not be combined with--annotate prokka.- If more than one 16S rRNA feature is found, GenoChar stores the longest 16S sequence in the main table.
- Legacy
genochar summarize ...andgenochar pipeline ...calls are still accepted in v0.6.3, but the preferred interface is the single-command form shown above.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genochar-0.6.3.2.tar.gz.
File metadata
- Download URL: genochar-0.6.3.2.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6feaf4dbc6c778736920bbbb8543b5633289b27a57c7aaef578e242d4c0b583
|
|
| MD5 |
baa813d700f1753e7a0a0e6f2af04a62
|
|
| BLAKE2b-256 |
2fa76e16940379550d6b3a43c73ee06b805d46ab835ccedd066ddee95ced3012
|
File details
Details for the file genochar-0.6.3.2-py3-none-any.whl.
File metadata
- Download URL: genochar-0.6.3.2-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f68134a791c3025dc5b9ce39ad4fa6dc7fd4f6cb84d43a70cc172b194d4440c4
|
|
| MD5 |
9a8a4481d3e87dee3834c066e68969b2
|
|
| BLAKE2b-256 |
f7943af86413fb6e45a43ded305b8fba97cbd97e6114e25023176b86ebbc2082
|