Gene expression quantification with a pan-transcriptomic gapped k-mer index
Project description
PanXpress: Gene Expression Quantification with a Pan-Transcriptomic Gapped K-mer Index
PanXpress is a unified framework for bacterial pan-transcriptomics that:
- Corrects ambiguous annotations in GFF files using a two-step algorithm
- Constructs a pan-transcriptomic reference FASTA file from genomic FASTA and corrected GFF annotation files
- Builds a gapped k-mer index over the pan-transcriptomic reference
- Supports alignment-free mapping of reads to genes from FASTQ samples
- Quantifies gene expression across strains
In case of problems, please file an issue in the issue tracker.
See CHANGELOG.md for recent changes.
Usage Guide
To use PanXpress you will need the following data and files:
- GFF annotation files and genome FASTA files for the bacterial strains to be included in the reference index. For a given strain, the annotation and genome files must share the same filename (differing only in their extension).
- FASTQ samples for mapping.
Additionally, for the correction of annotation files where similar proteins are grouped into gene groups, protein FASTA files can be provided by the user. If these are not available, they can be generated by PanXpress using the agat tool.
The typical workflow proceeds in five steps: (1) correct the GFF annotation files, (2) build the pan-transcriptomic reference, (3) construct the index, (4) map reads, and (5) convert the raw reads counts to transcripts per million.
Installation Guide
To run the software, a conda environment with the required libraries needs to be created. A list of needed libraries is provided in the environment.yml file in the repository. You can create the environment with the following command:
conda env create
An environment with the name panxpress will be created. To activate the environment and install the package from the repository, run:
conda activate panxpress
pip install -e .
Examples
To better understand how to run PanXpress, we included into this repository a folder reads with a few simulated reads from a mixture of 3 strains of pseudomonas aerugionosa (both single and paired end) and a folder ref, that contains pa_2_strains, with the GFF annotation files, the genome and protein FASTA files for 2 strains of pseudomonas aerugionsa. With will use these files to exemplify how to use PanXpress.
Step 1 — Correcting the Annotation Files
We recommend applying a correction algorithm to the annotation GFF files using the panxpress correct_gff command.
Proteins are grouped into gene groups via a two-step process:
- Jaccard filtering: The Jaccard similarity is computed for all pairs of proteins across bacterial strains. Pairs with a similarity score above a threshold
t1proceed to the next step. - Alignment filtering: Surviving pairs are aligned. Pairs with a normalized alignment score above a threshold
t2are grouped into the same gene group and assigned a shared gene group name. This information is then used to rewrite the annotation files.
Arguments
--input_gff_folder| Path to the folder containing the input annotation.gfffiles.--input_normalized_gff_folder| Path to a folder where normalized GFF files will be written. Two corrections are applied: (1) each protein ID is given a single consistent name across strains; (2) genes from plasmids are appended with the suffix"Plasmid"to distinguish them from chromosomal genes in expression results.--output_gff_folder| Path to the folder where corrected annotation.gfffiles will be stored. Hypothetical proteins are assigned the gene name"unnamed".--t1| Similarity threshold for step 1 (Jaccard filtering).--t2| Score threshold for step 2 (alignment filtering).--threads| Number of threads for parallelized Jaccard similarity calculation and protein pair alignments.--output_plot| Path prefix for plots of the alignment score distribution.--output_folder_data| Path prefix for output data files (Jaccard similarity values, alignment scores, group names, etc.).--input_protein_folder| Path to a folder containing protein.faafiles. If not available, provide the two arguments below instead.--input_genome_folder| (Alternative to--input_protein_folder) Path to a folder containing genome.fnafiles, used to generate protein files.--output_protein_folder| (Alternative to--input_protein_folder) Path to the folder where generated protein.faafiles will be stored.
How to run (general input)
With protein FASTA files provided:
panxpress correct_gff \
--input_protein_folder <folder_with_proteins> \
--input_gff_folder <folder_with_annotations> \
--input_normalized_gff_folder <folder_for_normalized_annotations> \
--output_gff_folder <folder_for_corrected_annotations> \
--k 7 \
--t1 0.02 \
--t2 0.75 \
--threads 8 \
--output_folder_data <filename_prefix_for_data>
Without protein FASTA files (generate them from genomes):
panxpress correct_gff \
--input_genome_folder <folder_with_genomes> \
--output_protein_folder <folder_with_proteins> \
--input_gff_folder <folder_with_annotations> \
--input_normalized_gff_folder <folder_for_normalized_annotations> \
--output_gff_folder <folder_for_corrected_annotations> \
--k 7 \
--t1 0.02 \
--t2 0.75 \
--threads 8 \
--output_folder_data <filename_prefix_for_data>
How to run (provided files)
With protein FASTA files provided:
panxpress correct_gff \
--input_protein_folder ref/pa_2_strains/protein_ncbi \
--input_gff_folder ref/pa_2_strains/gff3 \
--input_normalized_gff_folder ref/pa_2_strains/gff3_normalized \
--output_gff_folder ref/pa_2_strains/gff3_corrected \
--k 7 \
--t1 0.02 \
--t2 0.75 \
--threads 8 \
--output_folder_data ref/pa_2_strains
Without protein FASTA files (generate them from genomes):
panxpress correct_gff \
--input_genome_folder ref/pa_2_strains/genomes \
--output_protein_folder ref/pa_2_strains/protein_agat \
--input_gff_folder ref/pa_2_strains/gff3 \
--input_normalized_gff_folder ref/pa_2_strains/gff3_normalized \
--output_gff_folder ref/pa_2_strains/gff3_corrected \
--k 7 \
--t1 0.02 \
--t2 0.75 \
--threads 8 \
--output_folder_data ref/pa_2_strains
Step 2 — Building the Reference
The pan-transcriptomic reference is built using the panxpress build_reference command.
The reference is a FASTA file where each header is a unique gene identifier and each entry corresponds to one occurrence of a given gene name in a given strain. Because many gene names are shared across strains, multiple entries may share the same header. Since some tools (e.g., Bowtie2 and Salmon) do not support FASTA files with duplicate headers, PanXpress also generates a version of the reference with fully unique IDs.
Arguments
--annotation_dir| Path to the folder containing the corrected annotation.gfffiles.--genomes_dir| Path to the folder containing the genome.fnafiles.--output_reference_file| Output FASTA file.--output_reference_file_unique| Output FASTA file — unique headers variant.--output_reference_file_genes| Output FASTA file containing only named genes (hypothetical proteins excluded).--output_reference_file_unique_genes| Output FASTA file containing only named genes — unique headers variant.--output_aux_files| Path prefix for auxiliary data files (e.g., mapping of gene names to gene IDs used during the mapping step).
How to run (general input)
panxpress build_reference \
--annotation_dir <folder_with_annotations> \
--genomes_dir <folder_with_genomes> \
--output_reference_file <filename>.fna \
--output_reference_file_unique <filename>.fna \
--output_reference_file_genes <filename>.fna \
--output_reference_file_unique_genes <filename>.fna \
--output_aux_files <path_prefix_for_auxiliary_data>
How to run (provided files)
panxpress build_reference \
--annotation_dir ref/pa_2_strains/gff3_corrected \
--genomes_dir ref/pa_2_strains/genomes \
--output_reference_file ref/spliced_genomes/pantranscriptome_2_strains.fna \
--output_reference_file_unique ref/spliced_genomes/pantranscriptome_unique_headers_2_strains.fna \
--output_reference_file_genes ref/spliced_genomes/pantranscriptome_genes_2_strains.fna \
--output_reference_file_unique_genes ref/spliced_genomes/pantranscriptome_genes_unique_headers_2_strains.fna \
--output_aux_files ref/pa_2_strains/pa_2_strains > ref/build_reference_2_strains.log
Step 3 — Index Construction
The index is built using the panxpress index command.
The index is backed by a cuckoo hash table that stores a mapping of gapped k-mers to the genes in which they appear. The number of genes tracked per k-mer is controlled by the --colorset-size parameter. If you want PanXpress to maximize the number of colors within memory constraints, provide the total number of genes in the pan-transcriptomic reference via --ngenes and the maximum color set size will be calculated automatically.
Arguments
--genes| FASTA file of the reference.--index| Output path prefix for the resulting index.-n| Estimated number of k-mers.--ngenes| Number of unique gene IDs.--colorset-size| Number of genes tracked per k-mer entry.--mask| Mask pattern used to compute gapped k-mers (e.g., -"####_###_####_#__#__#_####_###_####").--k| K-mer length.--fill| Fill rate parameter for the cuckoo hash table.
Tip: For pseudomonas aeruginosa, 4 is a good number for the maximum number of colors since only a few k-mers occur in more than 4 different genes. To get a better idea on how to choose n and ngenes you can check the output from the build_reference command. In that output, the maximum gene ID is printed (use this for ngenes) and the length of the pan-transcriptome is an upper bound for n.
How to run (general input)
panxpress index \
--genes <reference>.fna \
--index <path_prefix_for_index> \
-n 20000000 \
--ngenes 2600 \
--colorset-size 4 \
--mask "####_###_####_#__#__#_####_###_####" \
--fill 0.95
How to run (provided files)
panxpress index \
--genes ref/spliced_genomes/pantranscriptome_2_strains.fna \
--index ref/pa_2_strains/spliced_index_2_strains \
-n 20000000 \
--ngenes 2600 \
--colorset-size 4 \
--mask "####_###_####_#__#__#_####_###_####" \
--fill 0.95
Step 4 — Mapping
Read mapping is supported by the panxpress pmap command.
Arguments
--index| Path prefix of the index.--fastq| Input FASTQ file.--mapping-file| Pickle dictionary mapping gene IDs to gene names, generated during the reference building step.--output-file| Output folder name.--threads-mapping| Number of threads for mapping.--unnamed_gene_id| Gene ID for hypothetical proteins. This can be obtained from the mapping file.
Tip: To get the correct value for unnamed_gene_id, you can use the following bash command:
unnamed_gene_id=$(python3 -c "import pickle; f='<output_aux_files>_gene_name_to_gene_id'; d=pickle.load(open(f,'rb')); print(d['unnamed'])")
Note that output_aux_files should be the parameter you used in the build_reference command.
How to run (general input)
panxpress pmap \
--index <index_prefix> \
--fastq <reads_fastq> \
--output-file <output_folder> \
--threads-mapping 8 \
--mapping-file <gene_id_to_gene_name> \
--unnamed_gene_id 3
How to run (provided files) - SINGLE END READS
panxpress map \
--index ref/pa_2_strains/spliced_index_2_strains \
--fastq reads/simulated_regulated_single_reads_pa_3_strains.fq \
--output-file results/single_end_reads_pa_3_strains \
--threads-mapping 8 \
--mapping-file ref/pa_2_strains/pa_2_strains_gene_id_to_gene_name \
--unnamed_gene_id $unnamed_gene_id
How to run (provided files) - PAIRED END READS
panxpress map \
--index ref/pa_2_strains/spliced_index_2_strains \
--fastq reads/simulated_regulated_paired_end_reads_pa_3_strains_1.fq \
--paired-end reads/simulated_regulated_paired_end_reads_pa_3_strains_2.fq \
--output-file results/paired_end_reads_pa_3_strains \
--threads-mapping 8 \
--mapping-file ref/pa_2_strains/pa_2_strains_gene_id_to_gene_name \
--unnamed_gene_id $unnamed_gene_id
Step 5 — Gene expression values
You can additionally convert the information of how many reads are mapped to each gene to gene expression quantification in transcripts per million. This is suported by the panxpress convert_TPM command.
Arguments
--raw_counts_file| Output.matfile from the mapping step.--genes_info_file_prefix| Path prefix for auxiliary data files (e.g., mapping of gene names to gene IDs used during the mapping step). Identical tooutput_aux_filesin the build reference command.--output_file| Output file name.
How to run (general input)
panxpress convert_TPM \
--raw_counts_file <raw_counts_file> \
--genes_info_file_prefix <genes_info_file_prefix> \
--output_file <output_file>
How to run (provided files)
panxpress convert_TPM \
--raw_counts_file results/single_end_reads_pa_3_strains/count.mat \
--genes_info_file_prefix ref/pa_2_strains/pa_2_strains \
--output_file results/single_end_reads_pa_3_strains/counts_TPM.txt
Full Workflow Scripts
For ready-to-run workflows on both simulated and real read data, please refer to the README in the scripts/ folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file panxpress-0.2.1.tar.gz.
File metadata
- Download URL: panxpress-0.2.1.tar.gz
- Upload date:
- Size: 119.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
819e11bc32a01f25dadeec4c5f8f4343cf0f462736f1817c199c231396d7baa6
|
|
| MD5 |
750d44a4cc8a15fc6590f0eacce744dd
|
|
| BLAKE2b-256 |
fc8fda178cdd500c51dcdb94fa5885d1a438fe4c814b08582482fdf903c349f2
|
File details
Details for the file panxpress-0.2.1-py3-none-any.whl.
File metadata
- Download URL: panxpress-0.2.1-py3-none-any.whl
- Upload date:
- Size: 134.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbad17a9809e425f755746da8296a0e9fa436047319f8daeb8968105686574b8
|
|
| MD5 |
f9c6bf1d760d6ad8a53d5a8fc5439aa1
|
|
| BLAKE2b-256 |
f6503ccdeff49cebe4830564ee807a5c652680100537eface442f53af3f79a75
|