Skip to main content

A collection of Python scripts for converting common bioinformatics file formats

Project description

Genome Format Converters

Author: Benjamin Narh-Madey

Affiliation: Hittinger Lab, Laboratory of Genetics, University of Wisconsin-Madison

Python 3.6+ License: MIT

A collection of Python scripts for converting common bioinformatics file formats. Each script follows a simple, uniform interface: you point it to an input directory, and it writes converted files to an output directory.

Table of Contents

Features

  • Uniform interface: all scripts accept --input-dir and --output-dir arguments.
  • Batch processing: convert all files of a given type in a directory at once.
  • Lightweight: only requires a few well‑maintained Python libraries.
  • Well tested: each script has been tested on small example datasets.

Installation

Clone the repository: git clone https://github.com/K-nie/genome-format-converters.git cd genome-format-converters Install the required dependencies: pip install -r requirements.txt

Note: For scripts that work with BAM or VCF files, you also need pysam (included in requirements.txt).For BLAST tabular conversion, you need BLAST+ installed separately (optional – only if you generate the input files).

Usage All scripts are used in the same way. After installing the package (pip install genome-format-converters), users can run the tool from the command line using the gfc command followed by a subcommand. The general syntax is: gfc --input-dir INPUT_DIR --output-dir OUTPUT_DIR [options] The input directory should contain the files you want to convert. The output directory will be created if it doesn’t exist. Each subcommand processes all files with recognised extensions in the input directory.

Getting help

Run: gfc --help to see all available subcommands, or gfc --help for detailed options. Examples: Get help for a specific subcommand (e.g., gff3-to-gtf) type: gfc gff3-to-gtf --help

Command Reference

Annotation Format Conversions

Subcommand Description Example

  1. gff3-to-gtf Convert GFF3 to GTF gfc gff3-to-gtf --input-dir ./gff_files --output-dir ./gtf_output
  2. gff3-to-bed Convert GFF3 to 6‑column BED gfc gff3-to-bed --input-dir ./gff_files --output-dir ./bed_output
  3. genbank-to-gff3 Convert GenBank to GFF3 gfc genbank-to-gff3 --input-dir ./gbk_files --output-dir ./gff3_output
  4. gff3-to-table Convert GFF3 to tab‑separated feature table gfc gff3-to-table --input-dir ./gff_files --output-dir ./table_output
  5. gff3-to-protein Extract protein sequences from GFF3 + FASTA gfc gff3-to-protein --input-dir ./data --output-dir ./proteins
  6. fasta-gff-to-gbk Convert paired FASTA and GFF3 files to GenBank gfc fasta-gff-to-gbk --input-dir ./data --output-dir ./gbk_output

Sequence Format Conversions

Subcommand Description Example

  1. fasta-to-fastq FASTA → FASTQ with default quality (I) gfc fasta-to-fastq --input-dir ./fasta --output-dir ./fastq
  2. fastq-to-fasta FASTQ → FASTA (drop qualities) gfc fastq-to-fasta --input-dir ./fastq --output-dir ./fasta
  3. fasta-qual-to-fastq Combine FASTA + QUAL into FASTQ gfc fasta-qual-to-fastq --input-dir ./data --output-dir ./fastq
  4. fastq-to-fasta-qual Split FASTQ into FASTA and QUAL gfc fastq-to-fasta-qual --input-dir ./fastq --output-dir ./split
  5. fasta-to-table FASTA → two‑column TSV (id, sequence) gfc fasta-to-table --input-dir ./fasta --output-dir ./tables
  6. convert-alignment Convert alignment formats (fasta, phylip, nexus, clustal) gfc convert-alignment --input-dir ./aln --output-dir ./phylip --in-format fasta --out-format phylip

Alignment / Mapping Results

Subcommand Description Example

  1. bam-to-bed Convert BAM/SAM to BED6 gfc bam-to-bed --input-dir ./bam_files --output-dir ./bed
  2. blast-to-links Convert BLAST tabular (outfmt 6) to link TSV gfc blast-to-links --input-dir ./blast_results --output-dir ./links --min-length 100 --min-identity 30
  3. delta-to-tab Convert MUMmer .delta to tabular coordinates gfc delta-to-tab --input-dir ./delta_files --output-dir ./tables
  4. maf-to-xmfa Convert MAF to XMFA (progressiveMauve format) gfc maf-to-xmfa --input-dir ./maf_files --output-dir ./xmfa

Variant Formats (VCF)

Subcommand Description Example

  1. vcf-to-bed Convert VCF to BED intervals gfc vcf-to-bed --input-dir ./vcf_files --output-dir ./bed
  2. vcf-to-table Convert VCF to tab‑separated table gfc vcf-to-table --input-dir ./vcf_files --output-dir ./tables
  3. vcf-to-consensus Create consensus FASTA from VCF + reference gfc vcf-to-consensus --input-dir ./data --output-dir ./consensus

Phylogenetic Tree Formats

Subcommand Description Example

  1. tree-convert Convert tree formats (newick, nexus, phyloxml) gfc tree-convert --input-dir ./trees --output-dir ./converted --in-format newick --out-format nexus
  2. annotate-tree Add alignment sequences to tree (output NEXUS) gfc annotate-tree --tree tree.nwk --aln alignment.fasta --output annotated.nex

Scripts Overview

The underlying Python scripts are located in src/genome_format_converters/converters/. Each script can also be run independently (though the gfc interface is recommended). Below is a quick reference of the scripts and their input/output formats.

Script Description Input extensions Output extension

  1. gff3_to_gtf.py GFF3 → GTF .gff3, .gff .gtf
  2. gff3_to_bed.py GFF3 → 6‑column BED .gff3, .gff .bed
  3. genbank_to_gff3.py GenBank → GFF3 .gbk, .gb .gff3
  4. gff3_to_table.py GFF3 → tab‑separated feature table .gff3, .gff .tsv
  5. gff3_to_protein.py GFF3 + FASTA → protein FASTA .gff3/.gff + .fasta/.fa .faa
  6. fasta_to_fastq.py FASTA → FASTQ (with default quality) .fasta, .fa, .fna,.fas .fastq
  7. fastq_to_fasta.py FASTQ → FASTA .fastq, .fq .fasta
  8. fasta_qual_to_fastq.py Combine FASTA + QUAL → FASTQ .fasta/.fa + .qual .fastq
  9. fastq_to_fasta_qual.py Split FASTQ → FASTA + QUAL .fastq, .fq .fasta, .qual
  10. convert_alignment.py Alignment format converter (FASTA, PHYLIP, NEXUS, CLUSTAL) any alignment file user‑specified
  11. fasta_to_table.py FASTA → two‑column TSV (ID, sequence) .fasta, .fa, .fna, .fas .tsv
  12. bam_to_bed.py BAM/SAM → BED6 .bam, .sam .bed
  13. blast_tab_to_links.py BLAST tabular (outfmt 6) → simplified link TSV .tab .links.tsv
  14. delta_to_tab.py MUMmer .delta → tabular alignment coordinates .delta .tsv
  15. maf_to_xmfa.py MAF → XMFA (progressiveMauve format) .maf .xmfa
  16. vcf_to_bed.py VCF/BCF → 1‑bp BED intervals .vcf, .vcf.gz, .bcf .bed
  17. vcf_to_table.py VCF/BCF → tab‑separated table (TSV) .vcf, .vcf.gz, .bcf .tsv
  18. vcf_to_consensus.py VCF + reference FASTA → consensus FASTA per sample .vcf/.vcf.gz + .fasta .fa
  19. tree_convert.py Newick ↔ NEXUS ↔ PhyloXML .nwk, .nex, .xml user‑specified
  20. annotate_tree.py Add alignment sequences to tree (NEXUS output) .nwk + .fasta (aligned) .nex
  21. convert_all_gff_fasta_to_gbk.py FASTA+GFF → GenBank .fasta/.fa + .gff3/.gff .gbk

Testing

All scripts have been tested on small example datasets located in the tests/test_data/ directory. These test files cover the basic functionality of each converter. To run the tests yourself, install the package in development mode (pip install -e .) and execute the example commands from the Command Reference using the provided test data. For instance:

License

This project is licensed under the MIT License – see the LICENSE file for details.

Contributing

Contributions are welcome! If you have a new converter or an improvement, please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genome_format_converters-0.1.1.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genome_format_converters-0.1.1-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file genome_format_converters-0.1.1.tar.gz.

File metadata

  • Download URL: genome_format_converters-0.1.1.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for genome_format_converters-0.1.1.tar.gz
Algorithm Hash digest
SHA256 52cd637fb65336bbd2792bbefb2160ef98f8234b8c3aaa8061a8e2f0d817a4ef
MD5 a5bfa4d43daf201e3d9fc42a9e376d99
BLAKE2b-256 06edb8f48e3f66aff88c1c7ce0405df46018b33a34a8479a6af5c0bd13933efc

See more details on using hashes here.

File details

Details for the file genome_format_converters-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for genome_format_converters-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d6f74721d9927f2da9756c1608c577c60bc23093abfcac0cbb6407ed26ad9b5
MD5 7f2ee251358372e7f91f93a56c672ae0
BLAKE2b-256 5a1519f1c724483f0b904af3f86c4d6e9da10bfc40fa17e66563976fe9db30db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page