Skip to main content

Gene expression quantification with a pan-transcriptomic gapped k-mer index

Project description

PanXpress: Gene Expression Quantification with a Pan-Transcriptomic Gapped K-mer Index

PanXpress is a unified framework for bacterial pan-transcriptomics that:

  • Corrects ambiguous annotations in GFF files using a two-step algorithm
  • Constructs a pan-transcriptomic reference FASTA file from genomic FASTA and corrected GFF annotation files
  • Builds a gapped k-mer index over the pan-transcriptomic reference
  • Supports alignment-free mapping of reads to genes from FASTQ samples
  • Quantifies gene expression across strains

In case of problems, please file an issue in the issue tracker.

See CHANGELOG.md for recent changes.


Usage Guide

To use PanXpress you will need the following data and files:

  • GFF annotation files and genome FASTA files for the bacterial strains to be included in the reference index. For a given strain, the annotation and genome files must share the same filename (differing only in their extension).
  • FASTQ samples for mapping.

Additionally, for the correction of annotation files where similar proteins are grouped into gene groups, protein FASTA files can be provided by the user. If these are not available, they can be generated by PanXpress using the agat tool.

The typical workflow proceeds in five steps: (1) correct the GFF annotation files, (2) build the pan-transcriptomic reference, (3) construct the index, (4) map reads, and (5) convert the raw reads counts to transcripts per million.


Installation Guide

To run the software, a conda environment with the required libraries needs to be created. A list of needed libraries is provided in the environment.yml file in the repository. You can create the environment with the following command:

conda env create

An environment with the name panxpress will be created. To activate the environment and install the package from the repository, run:

conda activate panxpress
pip install -e .

Examples

To better understand how to run PanXpress, we included into this repository a folder reads with a few simulated reads from a mixture of 3 strains of pseudomonas aerugionosa (both single and paired end) and a folder ref, that contains pa_2_strains, with the GFF annotation files, the genome and protein FASTA files for 2 strains of pseudomonas aerugionsa. With will use these files to exemplify how to use PanXpress.


Step 1 — Correcting the Annotation Files

We recommend applying a correction algorithm to the annotation GFF files using the panxpress correct_gff command.

Proteins are grouped into gene groups via a two-step process:

  1. Jaccard filtering: The Jaccard similarity is computed for all pairs of proteins across bacterial strains. Pairs with a similarity score above a threshold t1 proceed to the next step.
  2. Alignment filtering: Surviving pairs are aligned. Pairs with a normalized alignment score above a threshold t2 are grouped into the same gene group and assigned a shared gene group name. This information is then used to rewrite the annotation files.

Arguments

  • --input_gff_folder | Path to the folder containing the input annotation .gff files.
  • --input_normalized_gff_folder | Path to a folder where normalized GFF files will be written. Two corrections are applied: (1) each protein ID is given a single consistent name across strains; (2) genes from plasmids are appended with the suffix "Plasmid" to distinguish them from chromosomal genes in expression results.
  • --output_gff_folder | Path to the folder where corrected annotation .gff files will be stored. Hypothetical proteins are assigned the gene name "unnamed".
  • --t1 | Similarity threshold for step 1 (Jaccard filtering).
  • --t2 | Score threshold for step 2 (alignment filtering).
  • --threads | Number of threads for parallelized Jaccard similarity calculation and protein pair alignments.
  • --output_plot | Path prefix for plots of the alignment score distribution.
  • --output_folder_data | Path prefix for output data files (Jaccard similarity values, alignment scores, group names, etc.).
  • --input_protein_folder | Path to a folder containing protein .faa files. If not available, provide the two arguments below instead.
  • --input_genome_folder | (Alternative to --input_protein_folder) Path to a folder containing genome .fna files, used to generate protein files.
  • --output_protein_folder | (Alternative to --input_protein_folder) Path to the folder where generated protein .faa files will be stored.

How to run (general input)

With protein FASTA files provided:

panxpress correct_gff \
  --input_protein_folder <folder_with_proteins> \
  --input_gff_folder <folder_with_annotations> \
  --input_normalized_gff_folder <folder_for_normalized_annotations> \
  --output_gff_folder <folder_for_corrected_annotations> \
  --k 7 \
  --t1 0.02 \
  --t2 0.75 \
  --threads 8 \
  --output_folder_data <filename_prefix_for_data>

Without protein FASTA files (generate them from genomes):

panxpress correct_gff \
  --input_genome_folder <folder_with_genomes> \
  --output_protein_folder <folder_with_proteins> \
  --input_gff_folder <folder_with_annotations> \
  --input_normalized_gff_folder <folder_for_normalized_annotations> \
  --output_gff_folder <folder_for_corrected_annotations> \
  --k 7 \
  --t1 0.02 \
  --t2 0.75 \
  --threads 8 \
  --output_folder_data <filename_prefix_for_data>

How to run (provided files)

With protein FASTA files provided:

panxpress correct_gff \
  --input_protein_folder ref/pa_2_strains/protein_ncbi \
  --input_gff_folder ref/pa_2_strains/gff3 \
  --input_normalized_gff_folder ref/pa_2_strains/gff3_normalized \
  --output_gff_folder ref/pa_2_strains/gff3_corrected \
  --k 7 \
  --t1 0.02 \
  --t2 0.75 \
  --threads 8 \
  --output_folder_data ref/pa_2_strains

Without protein FASTA files (generate them from genomes):

panxpress correct_gff \
  --input_genome_folder ref/pa_2_strains/genomes \
  --output_protein_folder ref/pa_2_strains/protein_agat \
  --input_gff_folder ref/pa_2_strains/gff3 \
  --input_normalized_gff_folder ref/pa_2_strains/gff3_normalized \
  --output_gff_folder ref/pa_2_strains/gff3_corrected \
  --k 7 \
  --t1 0.02 \
  --t2 0.75 \
  --threads 8 \
  --output_folder_data ref/pa_2_strains

Step 2 — Building the Reference

The pan-transcriptomic reference is built using the panxpress build_reference command.

The reference is a FASTA file where each header is a unique gene identifier and each entry corresponds to one occurrence of a given gene name in a given strain. Because many gene names are shared across strains, multiple entries may share the same header. Since some tools (e.g., Bowtie2 and Salmon) do not support FASTA files with duplicate headers, PanXpress also generates a version of the reference with fully unique IDs.

Arguments

  • --annotation_dir | Path to the folder containing the corrected annotation .gff files.
  • --genomes_dir | Path to the folder containing the genome .fna files.
  • --output_reference_file | Output FASTA file.
  • --output_reference_file_unique | Output FASTA file — unique headers variant.
  • --output_reference_file_genes | Output FASTA file containing only named genes (hypothetical proteins excluded).
  • --output_reference_file_unique_genes | Output FASTA file containing only named genes — unique headers variant.
  • --output_aux_files | Path prefix for auxiliary data files (e.g., mapping of gene names to gene IDs used during the mapping step).

How to run (general input)

panxpress build_reference \
  --annotation_dir <folder_with_annotations> \
  --genomes_dir <folder_with_genomes> \
  --output_reference_file <filename>.fna \
  --output_reference_file_unique <filename>.fna \
  --output_reference_file_genes <filename>.fna \
  --output_reference_file_unique_genes <filename>.fna \
  --output_aux_files <path_prefix_for_auxiliary_data>

How to run (provided files)

panxpress build_reference \
  --annotation_dir ref/pa_2_strains/gff3_corrected \
  --genomes_dir ref/pa_2_strains/genomes \
  --output_reference_file ref/spliced_genomes/pantranscriptome_2_strains.fna \
  --output_reference_file_unique ref/spliced_genomes/pantranscriptome_unique_headers_2_strains.fna \
  --output_reference_file_genes ref/spliced_genomes/pantranscriptome_genes_2_strains.fna \
  --output_reference_file_unique_genes ref/spliced_genomes/pantranscriptome_genes_unique_headers_2_strains.fna \
  --output_aux_files ref/pa_2_strains/pa_2_strains > ref/build_reference_2_strains.log

Step 3 — Index Construction

The index is built using the panxpress index command.

The index is backed by a cuckoo hash table that stores a mapping of gapped k-mers to the genes in which they appear. The number of genes tracked per k-mer is controlled by the --colorset-size parameter. If you want PanXpress to maximize the number of colors within memory constraints, provide the total number of genes in the pan-transcriptomic reference via --ngenes and the maximum color set size will be calculated automatically.

Arguments

  • --genes | FASTA file of the reference.
  • --index | Output path prefix for the resulting index.
  • -n | Estimated number of k-mers.
  • --ngenes | Number of unique gene IDs.
  • --colorset-size | Number of genes tracked per k-mer entry.
  • --mask | Mask pattern used to compute gapped k-mers (e.g., - "####_###_####_#__#__#_####_###_####").
  • --k | K-mer length.
  • --fill | Fill rate parameter for the cuckoo hash table.

Tip: For pseudomonas aeruginosa, 4 is a good number for the maximum number of colors since only a few k-mers occur in more than 4 different genes. To get a better idea on how to choose n and ngenes you can check the output from the build_reference command. In that output, the maximum gene ID is printed (use this for ngenes) and the length of the pan-transcriptome is an upper bound for n.

How to run (general input)

panxpress index \
  --genes <reference>.fna \
  --index <path_prefix_for_index> \
  -n 20000000 \
  --ngenes 2600 \
  --colorset-size 4 \
  --mask "####_###_####_#__#__#_####_###_####" \
  --fill 0.95

How to run (provided files)

panxpress index \
  --genes ref/spliced_genomes/pantranscriptome_2_strains.fna \
  --index ref/pa_2_strains/spliced_index_2_strains \
  -n 20000000 \
  --ngenes 2600 \
  --colorset-size 4 \
  --mask "####_###_####_#__#__#_####_###_####" \
  --fill 0.95

Step 4 — Mapping

Read mapping is supported by the panxpress pmap command.

Arguments

  • --index | Path prefix of the index.
  • --fastq | Input FASTQ file.
  • --mapping-file | Pickle dictionary mapping gene IDs to gene names, generated during the reference building step.
  • --output-file | Output folder name.
  • --threads-mapping | Number of threads for mapping.
  • --unnamed_gene_id | Gene ID for hypothetical proteins. This can be obtained from the mapping file.

Tip: To get the correct value for unnamed_gene_id, you can use the following bash command:

unnamed_gene_id=$(python3 -c "import pickle; f='<output_aux_files>_gene_name_to_gene_id'; d=pickle.load(open(f,'rb')); print(d['unnamed'])")

Note that output_aux_files should be the parameter you used in the build_reference command.

How to run (general input)

panxpress pmap \
  --index <index_prefix> \
  --fastq <reads_fastq> \
  --output-file <output_folder> \
  --threads-mapping 8 \
  --mapping-file <gene_id_to_gene_name> \
  --unnamed_gene_id 3

How to run (provided files) - SINGLE END READS

panxpress map \
  --index ref/pa_2_strains/spliced_index_2_strains \
  --fastq reads/simulated_regulated_single_reads_pa_3_strains.fq \
  --output-file results/single_end_reads_pa_3_strains \
  --threads-mapping 8 \
  --mapping-file ref/pa_2_strains/pa_2_strains_gene_id_to_gene_name \
  --unnamed_gene_id $unnamed_gene_id

How to run (provided files) - PAIRED END READS

panxpress map \
  --index ref/pa_2_strains/spliced_index_2_strains \
  --fastq reads/simulated_regulated_paired_end_reads_pa_3_strains_1.fq \
  --paired-end reads/simulated_regulated_paired_end_reads_pa_3_strains_2.fq \
  --output-file results/paired_end_reads_pa_3_strains \
  --threads-mapping 8 \
  --mapping-file ref/pa_2_strains/pa_2_strains_gene_id_to_gene_name \
  --unnamed_gene_id $unnamed_gene_id

Step 5 — Gene expression values

You can additionally convert the information of how many reads are mapped to each gene to gene expression quantification in transcripts per million. This is suported by the panxpress convert_TPM command.

Arguments

  • --raw_counts_file | Output .mat file from the mapping step.
  • --genes_info_file_prefix | Path prefix for auxiliary data files (e.g., mapping of gene names to gene IDs used during the mapping step). Identical to output_aux_files in the build reference command.
  • --output_file | Output file name.

How to run (general input)

panxpress convert_TPM \
--raw_counts_file <raw_counts_file> \
--genes_info_file_prefix <genes_info_file_prefix> \
--output_file <output_file>

How to run (provided files)

panxpress convert_TPM \
--raw_counts_file results/single_end_reads_pa_3_strains/count.mat \
--genes_info_file_prefix ref/pa_2_strains/pa_2_strains \
--output_file results/single_end_reads_pa_3_strains/counts_TPM.txt

Full Workflow Scripts

For ready-to-run workflows on both simulated and real read data, please refer to the README in the scripts/ folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panxpress-0.2.1.tar.gz (119.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

panxpress-0.2.1-py3-none-any.whl (134.3 kB view details)

Uploaded Python 3

File details

Details for the file panxpress-0.2.1.tar.gz.

File metadata

  • Download URL: panxpress-0.2.1.tar.gz
  • Upload date:
  • Size: 119.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for panxpress-0.2.1.tar.gz
Algorithm Hash digest
SHA256 819e11bc32a01f25dadeec4c5f8f4343cf0f462736f1817c199c231396d7baa6
MD5 750d44a4cc8a15fc6590f0eacce744dd
BLAKE2b-256 fc8fda178cdd500c51dcdb94fa5885d1a438fe4c814b08582482fdf903c349f2

See more details on using hashes here.

File details

Details for the file panxpress-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: panxpress-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 134.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for panxpress-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bbad17a9809e425f755746da8296a0e9fa436047319f8daeb8968105686574b8
MD5 f9c6bf1d760d6ad8a53d5a8fc5439aa1
BLAKE2b-256 f6503ccdeff49cebe4830564ee807a5c652680100537eface442f53af3f79a75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page