Skip to main content

A comprehensive tool for converting DNA FASTA files to annotated GenBank format with automated gene prediction using Augustus

Project description

GbkGen - GenBank File Generator

A comprehensive tool for converting DNA FASTA files to annotated GenBank format with automated gene prediction using Augustus. GbkGen provides a command-line interface for flexible genomic data processing.

Features

  • FASTA to GenBank Conversion: Convert DNA sequences from FASTA format to fully annotated GenBank files
  • Automated Gene Prediction: Integrated Augustus gene prediction with support for multiple species models
  • GFF File Support: Use existing GFF annotations or generate new ones with Augustus
  • Multiprocessing Support: Parallel processing for large datasets with configurable CPU cores
  • Multi-sequence Processing: Handle multiple DNA sequences in a single FASTA file
  • Species-Specific Models: Configurable Augustus species models for accurate gene prediction
  • Robust Error Handling: Comprehensive logging and error reporting
  • File Validation: Automatic validation of input files and compatibility checking
  • Temporary File Management: Automatic cleanup of intermediate files

Installation

Prerequisites

  • Python 3.13 or higher
  • Augustus gene prediction tool (installed and available in PATH)
  • pip or uv package manager

Using UV (Recommended)

# Clone the repository
git clone https://github.com/darrengao628/genebank_file_generater
cd genebank_file_generater

# Install with uv
uv sync

Using pip

# Clone the repository
git clone https://github.com/darrengao628/genebank_file_generater
cd genebank_file_generater

# Install dependencies
pip install -r genebank_file_generater/requirements.txt

Augustus Installation

Make sure Augustus is installed and available in your PATH:

# For Ubuntu/Debian
sudo apt-get install augustus

# For macOS with Homebrew
brew install augustus

# Or build from source
# Follow instructions at: http://bioinf.uni-greifswald.de/augustus/

Usage

Command Line Interface

Basic Usage

If installed from source:

# Convert FASTA to GenBank (automatically creates input.gbk)
python -m genebank_file_generater.genebank_generater input.fasta

# With custom output filename
python -m genebank_file_generater.genebank_generater input.fasta -o output.gbk

# With specific species model
python -m genebank_file_generater.genebank_generater input.fasta -s human

# Using multiple CPU cores for faster processing
python -m genebank_file_generater.genebank_generater input.fasta -c 8

If installed via pip:

# Convert FASTA to GenBank (automatically creates input.gbk)
gbkgen input.fasta

# With custom output filename
gbkgen input.fasta -o output.gbk

# With specific species model
gbkgen input.fasta -s human

# Using multiple CPU cores for faster processing
gbkgen input.fasta -c 8

Automatic GFF File Detection

The program automatically detects corresponding GFF files:

  • If input.fasta is provided, it looks for input.gff or input.gff3
  • If found, the GFF file is used automatically (no need for -g flag)
  • The output filename is always based on the input FASTA filename
# If 299.fa and 299.gff exist, this automatically uses 299.gff
python -m genebank_file_generater.genebank_generater 299.fa
# Creates 299.gbk as output

# Override automatic GFF detection with explicit GFF file
python -m genebank_file_generater.genebank_generater 299.fa -g custom.gff -o output.gbk

Advanced Usage

# Use existing GFF file instead of running Augustus
gbkgen input.fasta -g annotations.gff -o output.gbk

# Specify custom working directory
gbkgen input.fasta -w /tmp/augustus -o output.gbk

# Full example with all options
gbkgen input.fasta \
  --output output.gbk \
  --species aspergillus_fumigatus \
  --workdir ./augustus_output \
  --cpu 4

Command Line Options

Option Short Description Default
input Input DNA FASTA file (required)
--output -o Output GenBank file input.gbk
--species -s Augustus species model aspergillus_fumigatus
--workdir -w Working directory for Augustus ./augustus_output
--gff -g Pre-existing GFF3 file None
--cpu -c Number of CPU cores All available

Supported Species Models

GbkGen supports all Augustus species models. Common models include:

  • aspergillus_fumigatus - Aspergillus fumigatus (default)

For a complete list, run:

augustus --species=help

Project Structure

GbkGen/
├── README.md                           # Main project documentation
├── pyproject.toml                      # Project configuration
├── main.py                             # Simple entry point
├── claude.md                           # Technical analysis
├── genebank_file_generater/            # Core conversion library
│   ├── __init__.py
│   ├── genebank_generater.py          # Main conversion logic
│   ├── gff_parser.py                  # GFF file parsing
│   ├── record.py                      # Record and feature management
│   ├── pyproject.toml                 # Package configuration
│   ├── requirements.txt               # Dependencies
│   ├── README.md                      # Package documentation
│   └── ToDO.md                        # Development roadmap
├── augustus_output/                   # Default Augustus output directory



### Getting Help
- Check the [Issues](https://github.com/darrengao628/genebank_file_generater/issues) page
- Review the [ToDO.md](genebank_file_generater/ToDO.md) for known limitations
- Create a new issue with detailed error information


## Changelog

### Version 0.1.0
- Initial release
- Core FASTA to GenBank conversion functionality
- Augustus integration with multiprocessing support
- GFF file parsing and validation
- Comprehensive error handling and logging
- Package distribution support with PyPI
- Simplified dependencies for easier installation


## Acknowledgments

- **BioPython** team for sequence handling libraries
- **Augustus** team for gene prediction software
- **antiSMASH** project for GFF parsing components


---

For more information, visit the [project repository](https://github.com/darrengao628/genebank_file_generater) or contact the development team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genebank_file_generater-0.1.1.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genebank_file_generater-0.1.1-py3-none-any.whl (3.7 kB view details)

Uploaded Python 3

File details

Details for the file genebank_file_generater-0.1.1.tar.gz.

File metadata

  • Download URL: genebank_file_generater-0.1.1.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for genebank_file_generater-0.1.1.tar.gz
Algorithm Hash digest
SHA256 23c4f0d8766c44b9e9ba805412c7390da852ef41366f2baa7a1e80d549250a3d
MD5 25313e0bae59aaf9ffa210dcd8a161d3
BLAKE2b-256 11341f035b4275f9171bb5c3271f89d6e2471f78d7f443e49207371aca5cc7e9

See more details on using hashes here.

File details

Details for the file genebank_file_generater-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for genebank_file_generater-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11ea5fed490ba2521888a49233da97ec8d451a71dd197664139abde06ec202c7
MD5 95e8f9835301c37b0fb878e934002653
BLAKE2b-256 3a7baf44eb8bc30021262d0cd6118451e199ea19d04789c01272dcd8cc1c4d2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page