Skip to main content

A collection of Python scripts for converting common bioinformatics file formats

Project description

Genome Format Converters

Author: Benjamin Narh-Madey

Affiliation: Hittinger Lab, Laboratory of Genetics, University of Wisconsin-Madison

Python 3.6+ License: MIT

A collection of Python scripts for converting common bioinformatics file formats. Each script follows a simple, uniform interface: you point it to an input directory, and it writes converted files to an output directory.

Table of Contents

Features

  • Uniform interface: all scripts accept --input-dir and --output-dir arguments.
  • Batch processing: convert all files of a given type in a directory at once.
  • Lightweight: only requires a few well‑maintained Python libraries.
  • Well tested: each script has been tested on small example datasets.

Installation

Clone the repository: git clone https://github.com/K-nie/genome-format-converters.git cd genome-format-converters Install the required dependencies: pip install -r requirements.txt

Note: For scripts that work with BAM or VCF files, you also need pysam (included in requirements.txt).For BLAST tabular conversion, you need BLAST+ installed separately (optional – only if you generate the input files).

Usage All scripts are used in the same way. After installing the package (pip install genome-format-converters), users can run the tool from the command line using the gfc command followed by a subcommand. The general syntax is: gfc --input-dir INPUT_DIR --output-dir OUTPUT_DIR [options] The input directory should contain the files you want to convert. The output directory will be created if it doesn’t exist. Each subcommand processes all files with recognised extensions in the input directory.

Getting help

Run: gfc --help to see all available subcommands, or gfc --help for detailed options. Examples: Get help for a specific subcommand (e.g., gff3-to-gtf) type: gfc gff3-to-gtf --help

Command Reference

Annotation Format Conversions

Subcommand Description Example

  1. gff3-to-gtf -------------------------- Convert GFF3 to GTF ------------------------------------------ gfc gff3-to-gtf --input-dir ./gff_files --output-dir ./gtf_output
  2. gff3-to-bed -------------------------- Convert GFF3 to 6‑column BED ---------------------------------- gfc gff3-to-bed --input-dir ./gff_files --output-dir ./bed_output
  3. genbank-to-gff3 ----------------------- Convert GenBank to GFF3 ------------------------------------- gfc genbank-to-gff3 --input-dir ./gbk_files --output-dir ./gff3_output
  4. gff3-to-table ------------------------- Convert GFF3 to tab‑separated feature table ------------------ gfc gff3-to-table --input-dir ./gff_files --output-dir ./table_output
  5. gff3-to-protein ---------------------- Extract protein sequences from GFF3 + FASTA ------------------ gfc gff3-to-protein --input-dir ./data --output-dir ./proteins
  6. fasta-gff-to-gbk --------------------- Convert paired FASTA and GFF3 files to GenBank -------------- gfc fasta-gff-to-gbk --input-dir ./data --output-dir ./gbk_output

Sequence Format Conversions

Subcommand Description Example

  1. fasta-to-fastq ------------------------ FASTA → FASTQ with default quality (I) ------------------------ gfc fasta-to-fastq --input-dir ./fasta --output-dir ./fastq
  2. fastq-to-fasta ------------------------ FASTQ → FASTA (drop qualities) ------------------------------- gfc fastq-to-fasta --input-dir ./fastq --output-dir ./fasta
  3. fasta-qual-to-fastq ------------------ Combine FASTA + QUAL into FASTQ --------------------------- gfc fasta-qual-to-fastq --input-dir ./data --output-dir ./fastq
  4. fastq-to-fasta-qual ------------------ Split FASTQ into FASTA and QUAL ----------------------------- gfc fastq-to-fasta-qual --input-dir ./fastq --output-dir ./split
  5. fasta-to-table ---------------------- FASTA → two‑column TSV (id, sequence) ------------------------- gfc fasta-to-table --input-dir ./fasta --output-dir ./tables
  6. convert-alignment --------------------- Convert alignment formats (fasta, phylip, nexus, clustal) ----- gfc convert-alignment --input-dir ./aln --output-dir ./phylip --in-format fasta --out-format phylip

Alignment / Mapping Results

Subcommand Description Example

  1. bam-to-bed ------------------------- Convert BAM/SAM to BED6 ------------------------------------ gfc bam-to-bed --input-dir ./bam_files --output-dir ./bed
  2. blast-to-links ---------------------- Convert BLAST tabular (outfmt 6) to link TSV --------------- gfc blast-to-links --input-dir ./blast_results --output-dir ./links --min-length 100 --min-identity 30
  3. delta-to-tab ----------------------- Convert MUMmer .delta to tabular coordinates --------------- gfc delta-to-tab --input-dir ./delta_files --output-dir ./tables
  4. maf-to-xmfa ----------------------- Convert MAF to XMFA (progressiveMauve format) --------------- gfc maf-to-xmfa --input-dir ./maf_files --output-dir ./xmfa

Variant Formats (VCF)

Subcommand Description Example

  1. vcf-to-bed ------------------------- Convert VCF to BED intervals ------------------------------ gfc vcf-to-bed --input-dir ./vcf_files --output-dir ./bed
  2. vcf-to-table ------------------------- Convert VCF to tab‑separated table --------------------------- gfc vcf-to-table --input-dir ./vcf_files --output-dir ./tables
  3. vcf-to-consensus -------------------- Create consensus FASTA from VCF + reference ----------------- gfc vcf-to-consensus --input-dir ./data --output-dir ./consensus

Phylogenetic Tree Formats

Subcommand Description Example

  1. tree-convert ------------------------ Convert tree formats (newick, nexus, phyloxml) --------------- gfc tree-convert --input-dir ./trees --output-dir ./converted --in-format newick --out-format nexus
  2. annotate-tree ---------------------- Add alignment sequences to tree (output NEXUS) --------------- gfc annotate-tree --tree tree.nwk --aln alignment.fasta --output annotated.nex

Testing

All scripts have been tested on small example datasets located in the tests/test_data/ directory. These test files cover the basic functionality of each converter. To run the tests yourself, install the package in development mode (pip install -e .) and execute the example commands from the Command Reference using the provided test data. For instance:

License

This project is licensed under the MIT License – see the LICENSE file for details.

Contributing

Contributions are welcome! If you have a new converter or an improvement, please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genome_format_converters-0.1.2.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genome_format_converters-0.1.2-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file genome_format_converters-0.1.2.tar.gz.

File metadata

  • Download URL: genome_format_converters-0.1.2.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for genome_format_converters-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e0a104d016b971d06864a7239a9f473dac1d5e52bd06b83ca68c37b339130dc6
MD5 ad3413df8434fd727f0e78793690ba60
BLAKE2b-256 0758dfd0d1d44d2db651cf401d9e1a800116d6836c768c0e68adb6a64e536700

See more details on using hashes here.

File details

Details for the file genome_format_converters-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for genome_format_converters-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 68952466dd97a277134d1c2a890a1ef317a12be9ba256c7408b6fc6fd9b3457f
MD5 5898a99feee9eb687f600371d82de3e7
BLAKE2b-256 8b176f64c1063e81a223c1020d1a45dcfbe1860d008c984633b0fd7a8fd25692

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page