Skip to main content

A comprehensive tool for converting DNA FASTA files to annotated GenBank format with automated gene prediction using Augustus

Project description

GbkGen - GenBank File Generator

A comprehensive tool for converting DNA FASTA files to annotated GenBank format with automated gene prediction using Augustus. GbkGen provides a command-line interface for flexible genomic data processing.

Features

  • FASTA to GenBank Conversion: Convert DNA sequences from FASTA format to fully annotated GenBank files
  • Automated Gene Prediction: Integrated Augustus gene prediction with support for multiple species models
  • GFF File Support: Use existing GFF annotations or generate new ones with Augustus
  • Multiprocessing Support: Parallel processing for large datasets with configurable CPU cores
  • Multi-sequence Processing: Handle multiple DNA sequences in a single FASTA file
  • Species-Specific Models: Configurable Augustus species models for accurate gene prediction
  • Robust Error Handling: Comprehensive logging and error reporting
  • File Validation: Automatic validation of input files and compatibility checking
  • Temporary File Management: Automatic cleanup of intermediate files

Installation

Prerequisites

  • Python 3.13 or higher
  • Augustus gene prediction tool (installed and available in PATH)
  • pip or uv package manager

Using UV (Recommended)

# Clone the repository
git clone https://github.com/darrengao628/genebank_file_generater
cd genebank_file_generater

# Install with uv
uv sync

Using pip

# Clone the repository
git clone https://github.com/darrengao628/genebank_file_generater
cd genebank_file_generater

# Install dependencies
pip install -r genebank_file_generater/requirements.txt

Augustus Installation

Make sure Augustus is installed and available in your PATH:

# For Ubuntu/Debian
sudo apt-get install augustus

# For macOS with Homebrew
brew install augustus

# Or build from source
# Follow instructions at: http://bioinf.uni-greifswald.de/augustus/

Usage

Command Line Interface

Basic Usage

If installed from source:

# Convert FASTA to GenBank (automatically creates input.gbk)
python -m genebank_file_generater.genebank_generater input.fasta

# With custom output filename
python -m genebank_file_generater.genebank_generater input.fasta -o output.gbk

# With specific species model
python -m genebank_file_generater.genebank_generater input.fasta -s human

# Using multiple CPU cores for faster processing
python -m genebank_file_generater.genebank_generater input.fasta -c 8

If installed via pip:

# Convert FASTA to GenBank (automatically creates input.gbk)
gbkgen input.fasta

# With custom output filename
gbkgen input.fasta -o output.gbk

# With specific species model
gbkgen input.fasta -s human

# Using multiple CPU cores for faster processing
gbkgen input.fasta -c 8

Automatic GFF File Detection

The program automatically detects corresponding GFF files:

  • If input.fasta is provided, it looks for input.gff or input.gff3
  • If found, the GFF file is used automatically (no need for -g flag)
  • The output filename is always based on the input FASTA filename
# If 299.fa and 299.gff exist, this automatically uses 299.gff
python -m genebank_file_generater.genebank_generater 299.fa
# Creates 299.gbk as output

# Override automatic GFF detection with explicit GFF file
python -m genebank_file_generater.genebank_generater 299.fa -g custom.gff -o output.gbk

Advanced Usage

# Use existing GFF file instead of running Augustus
gbkgen input.fasta -g annotations.gff -o output.gbk

# Specify custom working directory
gbkgen input.fasta -w /tmp/augustus -o output.gbk

# Full example with all options
gbkgen input.fasta \
  --output output.gbk \
  --species aspergillus_fumigatus \
  --workdir ./augustus_output \
  --cpu 4

Command Line Options

Option Short Description Default
input Input DNA FASTA file (required)
--output -o Output GenBank file input.gbk
--species -s Augustus species model aspergillus_fumigatus
--workdir -w Working directory for Augustus ./augustus_output
--gff -g Pre-existing GFF3 file None
--cpu -c Number of CPU cores All available

Supported Species Models

GbkGen supports all Augustus species models. Common models include:

  • aspergillus_fumigatus - Aspergillus fumigatus (default)

For a complete list, run:

augustus --species=help

Project Structure

GbkGen/
├── README.md                           # Main project documentation
├── pyproject.toml                      # Project configuration
├── main.py                             # Simple entry point
├── claude.md                           # Technical analysis
├── genebank_file_generater/            # Core conversion library
│   ├── __init__.py
│   ├── genebank_generater.py          # Main conversion logic
│   ├── gff_parser.py                  # GFF file parsing
│   ├── record.py                      # Record and feature management
│   ├── pyproject.toml                 # Package configuration
│   ├── requirements.txt               # Dependencies
│   ├── README.md                      # Package documentation
│   └── ToDO.md                        # Development roadmap
├── augustus_output/                   # Default Augustus output directory

Getting Help

  • Check the Issues page
  • Review the ToDO.md for known limitations
  • Create a new issue with detailed error information

Changelog

Version 0.1.0

  • Initial release
  • Core FASTA to GenBank conversion functionality
  • Augustus integration with multiprocessing support
  • GFF file parsing and validation
  • Comprehensive error handling and logging
  • Package distribution support with PyPI
  • Simplified dependencies for easier installation

Acknowledgments

  • BioPython team for sequence handling libraries
  • Augustus team for gene prediction software
  • antiSMASH project for GFF parsing components

For more information, visit the project repository or contact the development team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genebank_file_generater-0.1.2.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genebank_file_generater-0.1.2-py3-none-any.whl (3.7 kB view details)

Uploaded Python 3

File details

Details for the file genebank_file_generater-0.1.2.tar.gz.

File metadata

  • Download URL: genebank_file_generater-0.1.2.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for genebank_file_generater-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9c797156413ddd2fc843b6c3438dff19e90bbfb35d49a189b99fcde76be71eba
MD5 385e54fbc5c2d4f76cdc96ca8733320b
BLAKE2b-256 3b17c7d3c2b2bde67f611e896457d01b24ccc2b98ad70b04951aaadbc3d91761

See more details on using hashes here.

File details

Details for the file genebank_file_generater-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for genebank_file_generater-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b0a0f7dd24a00c635c086d069d4ae309ffe9db6ca15740ddfdb0bb24151493bb
MD5 38355fe73672e2ea09481687d559d769
BLAKE2b-256 94b0f4b18e42b500168e3ccfea035d8c79b8b0ca493b1ab4665412e13d4188e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page