PyO3 bindings and Python interface to FragGeneScanRs, a gene prediction model for short and error-prone reads.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tomdstanton

These details have not been verified by PyPI

Project links

Coverage

Project description

🔗🐍⏭️ `pyfgs`

PyO3 bindings and Python interface to FragGeneScanRs, a gene prediction model for short and error-prone reads.

🗺️ Why `pyfgs`?

Built for noisy data

Standard ab initio predictors (like Prodigal or Pyrodigal) are fantastic for pristine, fully assembled contigs. However, they struggle with raw metagenomic reads or error-prone assemblies because they immediately break the open reading frame at the first sign of an indel. pyfgs uses an error-tolerant Hidden Markov Model trained on specific sequencing profiles (Illumina, 454, Sanger) to power through these sequencing errors, dynamically correct the reading frame, and salvage the translated protein.

Native frameshift tracking

Instead of just silently stitching broken genes together, pyfgs exposes the exact coordinates of every hallucinated or skipped base directly to Python. This allows you to rigorously track structural variants, correctly annotate INSDC-compliant pseudogenes, or export exact frameshift coordinates for downstream quality control.

No subprocess I/O tax

Running standard CLI bioinformatics tools from Python usually requires a heavy I/O penalty: dumping sequences to a temporary FASTA file, firing a subprocess, and parsing the text outputs back into memory. pyfgs binds directly to the underlying Rust engine. The HMM runs entirely in memory and yields native Python objects ready for immediate analysis.

True multithreading and zero-copy memory

pyfgs is designed to process massive datasets efficiently:

GIL-Free Inference: The Rust backend completely releases the Python Global Interpreter Lock (GIL) during the heavy HMM math. You can drop the predictor into a standard ThreadPoolExecutor and achieve true parallel processing across all your CPU cores.
Zero-Copy Bytes: The engine borrows raw byte slices (&[u8]) directly from Python's memory, bypassing the overhead of copying strings between languages.
Lazy Translation: Translating DNA to amino acids is computationally expensive. pyfgs evaluates sequence strings lazily, meaning you only pay the CPU and memory cost of string allocation if you explicitly request the sequence data.

A Pythonic API

Bioinformatics coordinates are notoriously messy. pyfgs outputs standard 0-based, half-open intervals ([start, end)), allowing you to slice sequence arrays immediately without wrestling with 1-based GFF3 coordinate math. When you do need standardized files, it includes heavily optimized, native-Rust context managers to stream perfectly compliant VCF, BED, GFF3, and FASTA files directly to disk without bloating your RAM.

🔧 Installing

This project is supported on Python 3.10 and later.

pyfgs can be installed directly from PyPI:

pip install pyfgs

💻 Usage

API Usage

For full API usage, please refer to the documentation.

import concurrent.futures
import pyfgs
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio import SeqIO

def process_contig(header_bytes: bytes, seq_bytes: bytes, finder: pyfgs.GeneFinder):
    """
    Worker function to find genes.
    Because find_genes() drops the GIL, this runs in true parallel.
    """
    genes = finder.find_genes(header_bytes, seq_bytes)
    return header_bytes, seq_bytes, genes

def main():
    # 1. Initialize the GeneFinder
    # We use the 'Complete' model for high-quality assemblies, but strictly
    # set whole_genome=False to force the HMM to hunt for frameshifts.
    finder = pyfgs.GeneFinder(pyfgs.Model.Complete, whole_genome=False)

    # 2. Stream the genome using the zero-allocation Rust FastaReader
    reader = pyfgs.FastaReader("bacterial_assembly.fasta")

    results = []

    # 3. Process all contigs concurrently
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Submit all contigs to the thread pool
        futures = [
            executor.submit(process_contig, header, seq, finder)
            for header, seq in reader
        ]

        # Gather results as they complete
        for future in concurrent.futures.as_completed(futures):
            results.append(future.result())

    # 4. Format into INSDC-compliant GenBank records
    records = []
    for header_bytes, seq_bytes, genes in results:
        header_str = header_bytes.decode('utf-8')
        record = SeqRecord(
            Seq(seq_bytes.decode('utf-8')),
            id=header_str,
            name=header_str,
            description="Annotated by pyfgs"
        )

        for i, gene in enumerate(genes):
            # Query the Rust backend for structural variants
            mutations = gene.mutations(seq_bytes)

            # INSDC Standard: Frameshifted ORFs cannot be 'CDS', must be 'pseudogene'
            feature_type = "pseudogene" if mutations else "CDS"

            qualifiers = {
                "source": "pyfgs",
                "inference": "ab initio prediction:pyfgs",
                "ID": f"{header_str}_FGS_{i+1}"
            }

            if mutations:
                qualifiers["pseudogene"] = ["unknown"]
                notes = []
                for mut in mutations:
                    mut_name = "insertion" if mut.mut_type == "ins" else "deletion"
                    # Include our Snippy-style variant notation
                    notes.append(f"Frameshift {mut_name} at pos {mut.pos} (codon {mut.codon_idx}). {mut.annotation}")
                qualifiers["note"] = notes
            else:
                # Only strictly intact CDS features receive a translation qualifier
                qualifiers["translation"] = [gene.translation().decode('utf-8')]

            # Biopython's FeatureLocation is natively 0-based and half-open,
            # mapping perfectly to our Gene.start and Gene.end!
            location = FeatureLocation(gene.start, gene.end, strand=gene.strand)
            feature = SeqFeature(location=location, type=feature_type, qualifiers=qualifiers)
            record.features.append(feature)

        records.append(record)

    # 5. Export to GenBank
    output_file = "annotated_genome.gbk"
    with open(output_file, "w") as out_handle:
        SeqIO.write(records, out_handle, "genbank")

if __name__ == "__main__":
    main()

CLI Usage

For CLI usage, type pyfgs --help

usage: pyfgs <seq> [options]

🔗🐍⏭️	PyO3 bindings and Python interface to FragGeneScanRs,
	a gene prediction model for short and error-prone reads.

Input options 💽:

  seq                 Sequence file (or '-' for stdin)
  -m, --model         Sequence error model (default: complete)
                       - short1: Illumina sequencing reads with about 0.1% error rate
                       - short5: Illumina sequencing reads with about 0.5% error rate
                       - short10: Illumina sequencing reads with about 1% error rate
                       - sanger5: Sanger sequencing reads with about 0.5% error rate
                       - sanger10: Sanger sequencing reads with about 1% error rate
                       - pyro5: 454 pyrosequencing reads with about 0.5% error rate
                       - pyro10: 454 pyrosequencing reads with about 1% error rate
                       - pyro30: 454 pyrosequencing reads with about 3% error rate
                       - complete: Complete genomic sequences or short sequence reads without sequencing error
  -r, --reads         Force FASTQ parsing (Overrides auto-detection)
  -w, --whole-genome  Strict contiguous ORFs. Disables error-tolerant frameshift detection.

Output options ⚙️:
  Provide a PATH to save to a file, or use the flag alone to print to stdout.

  --faa [PATH]        Output protein FASTA
  --fna [PATH]        Output nucleotide FASTA
  --bed [PATH]        Output BED6+1 format
  --gff [PATH]        Output GFF3 format
  --vcf [PATH]        Output VCF v4.2 format

Other options 🚧:

  -t, --threads       Number of threads (default: optimal)
  -v, --version       Print version and exit
  -h, --help          Print help and exit

🔖 Citation

For now, please cite the original FragGeneScanRs paper:

Van der Jeugt, F., Dawyndt, P. & Mesuere, B. FragGeneScanRs: faster gene prediction for short reads. BMC Bioinformatics 23, 198 (2022). https://doi.org/10.1186/s12859-022-04736-5

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the GNU General Public License v3.0. The FragGeneScanRs code was written by Peter Dawyndt, Bart Mesuere and Felix Van der Jeugt and is distributed under the terms of the GPLv3 as well. See https://github.com/FragGeneScanRs/LICENSE for more information.

This project is in no way affiliated, sponsored, or otherwise endorsed by the original FragGeneScanRs authors Peter Dawyndt, Bart Mesuere and Felix Van der Jeugt. It was developed by Tom Stanton during his Post-doc project at Monash University in the Wryes Lab.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tomdstanton

These details have not been verified by PyPI

Project links

Coverage

Release history Release notifications | RSS feed

0.0.1

Mar 26, 2026

This version

0.0.1b1 pre-release

Mar 24, 2026

0.0.1a3 pre-release

Mar 23, 2026

0.0.1a2 pre-release

Mar 20, 2026

0.0.1a1 pre-release

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfgs-0.0.1b1.tar.gz (95.0 kB view details)

Uploaded Mar 24, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyfgs-0.0.1b1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (4.2 MB view details)

Uploaded Mar 24, 2026 CPython 3.14macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

pyfgs-0.0.1b1-cp312-cp312-win_amd64.whl (2.0 MB view details)

Uploaded Mar 24, 2026 CPython 3.12Windows x86-64

pyfgs-0.0.1b1-cp312-cp312-manylinux_2_34_x86_64.whl (2.1 MB view details)

Uploaded Mar 24, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file pyfgs-0.0.1b1.tar.gz.

File metadata

Download URL: pyfgs-0.0.1b1.tar.gz
Upload date: Mar 24, 2026
Size: 95.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyfgs-0.0.1b1.tar.gz
Algorithm	Hash digest
SHA256	`dd590e0c85f346565bb4bae26a7405b6d079069a9c937448a4bab1b1373793c1`
MD5	`1e74f8b22446ce9cac4d7ca9cf8635f2`
BLAKE2b-256	`c210e757edca8e6c94a99a7b1c6a1f6b89788cb06faac8fb517b7b90ea445a05`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyfgs-0.0.1b1.tar.gz:

Publisher: publish.yml on tomdstanton/pyfgs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyfgs-0.0.1b1.tar.gz
- Subject digest: dd590e0c85f346565bb4bae26a7405b6d079069a9c937448a4bab1b1373793c1
- Sigstore transparency entry: 1170950152
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: tomdstanton/pyfgs@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Branch / Tag: refs/tags/v0.0.1-beta.1
- Owner: https://github.com/tomdstanton
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Trigger Event: push

File details

Details for the file pyfgs-0.0.1b1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

Download URL: pyfgs-0.0.1b1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Upload date: Mar 24, 2026
Size: 4.2 MB
Tags: CPython 3.14, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyfgs-0.0.1b1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm	Hash digest
SHA256	`d2a2f66b5bb987249296ac942ee67a9e739f82de99e42751a86789445cd940e7`
MD5	`c33c949e3be5c35c87d9172c286a6ecf`
BLAKE2b-256	`825dac902e72cfba54020fd2809d8d1aa880cf08ad1917f4d18d23e4bf05ed42`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyfgs-0.0.1b1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish.yml on tomdstanton/pyfgs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyfgs-0.0.1b1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Subject digest: d2a2f66b5bb987249296ac942ee67a9e739f82de99e42751a86789445cd940e7
- Sigstore transparency entry: 1170950250
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: tomdstanton/pyfgs@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Branch / Tag: refs/tags/v0.0.1-beta.1
- Owner: https://github.com/tomdstanton
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Trigger Event: push

File details

Details for the file pyfgs-0.0.1b1-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pyfgs-0.0.1b1-cp312-cp312-win_amd64.whl
Upload date: Mar 24, 2026
Size: 2.0 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyfgs-0.0.1b1-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`88312f3c1958ea7a3b2b69f8a42becd6ceb96f1b2d4233568db39470e170f396`
MD5	`23aa436b5c0565db8301d9dafd8a1997`
BLAKE2b-256	`7847e50332c507cd3ac936aba1b0800f4bf3b10235f1c65cf5b0b8484581e3e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyfgs-0.0.1b1-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on tomdstanton/pyfgs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyfgs-0.0.1b1-cp312-cp312-win_amd64.whl
- Subject digest: 88312f3c1958ea7a3b2b69f8a42becd6ceb96f1b2d4233568db39470e170f396
- Sigstore transparency entry: 1170950208
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: tomdstanton/pyfgs@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Branch / Tag: refs/tags/v0.0.1-beta.1
- Owner: https://github.com/tomdstanton
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Trigger Event: push

File details

Details for the file pyfgs-0.0.1b1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: pyfgs-0.0.1b1-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Mar 24, 2026
Size: 2.1 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyfgs-0.0.1b1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`cb62471edcb25d320aff3db19a288271ad42a319f1523430960421dd22b0d78f`
MD5	`6126d13d008515b77d1c0193fdd1697b`
BLAKE2b-256	`f460b4df32a016e8fae94b5677a6427fd9b932ebc7a3fa999b7056f6719fedb6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyfgs-0.0.1b1-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: publish.yml on tomdstanton/pyfgs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyfgs-0.0.1b1-cp312-cp312-manylinux_2_34_x86_64.whl
- Subject digest: cb62471edcb25d320aff3db19a288271ad42a319f1523430960421dd22b0d78f
- Sigstore transparency entry: 1170950309
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: tomdstanton/pyfgs@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Branch / Tag: refs/tags/v0.0.1-beta.1
- Owner: https://github.com/tomdstanton
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@698864b2a4d02e59e3d1f3e1a877c66e851c3c2a
- Trigger Event: push

pyfgs 0.0.1b1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

🔗🐍⏭️ pyfgs

🗺️ Why pyfgs?

🔧 Installing

💻 Usage

API Usage

CLI Usage

🔖 Citation

💭 Feedback

⚠️ Issue Tracker

🏗️ Contributing

📋 Changelog

⚖️ License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

🔗🐍⏭️ `pyfgs`

🗺️ Why `pyfgs`?