Skip to main content

Foli-seq data processing and analysis tools

Project description

folitools

folitools is a modular CLI toolkit and python library for Foli-seq data processing.

Foli-seq is a amplicon-based high-throughput sequencing method for profiling host gene expression from feces. Foli-seq is developed to probe amplicons of 320-380bp and sequence on PE150 Illumina sequencers.


Installation

Install from PyPI:

pip install folitools

The dependencies from conda are also required (note: cutadapt is intentionally not in this list — see below):

conda install -c bioconda -c conda-forge \
  fastp samtools bwa-mem2 star seqkit fastqc subread sambamba pigz gcc

Cutadapt fork

Starting with folitools 0.4.0, cutadapt is installed from our fork, published on PyPI as cutadapt-folitools (source: whatever60/cutadapt@folitools-perf). The fork installs the same cutadapt Python module and cutadapt console script as upstream, so nothing in your own tooling needs to change. It adds performance improvements that matter for the foli assign-probes workload (hundreds of i5/i7 primers, paired-end gzipped FASTQ inputs):

  • SeedMultiAdapterFilter — a seed-and-extend pre-filter for large non-anchored multi-adapter groups. Match objects are bit-identical to upstream cutadapt's per-adapter iteration; only the speed changes. Auto-activates above 8 eligible adapters in a group.
  • --input-compression-threads N — opt-in parallel decompression of gzipped inputs via pigz.
  • NPrefixIndexedAdapters — a dormant class that offers a fast path for ^N{k}<body> anchored adapters. Not wired into the auto path; reserved for future opt-in use.

End-to-end on a representative 3.1 M read-pair sample with 542×2 primers (-j 16), the combined effect of the fork plus the xopen/pigz-backed writer in add_umi cuts one sample from ~7:30 wall time to ~2:45 (~2.7× faster).

cutadapt-folitools is published as an sdist, so pip install folitools will compile cutadapt from source. A C compiler is required — the gcc entry in the conda command above covers it on Linux. On macOS the Xcode command-line tools suffice.

If you already have the upstream cutadapt wheel installed from PyPI it will be replaced by cutadapt-folitools when you install folitools, because the two distributions install the same cutadapt module and console script. Don't install both side-by-side.

If you would prefer the upstream cutadapt wheel on PyPI, you can skip the fork with:

pip install --no-deps folitools
pip install "cutadapt>=5.1" <other-deps>

but the foli assign-probes stage will be slower.

Usage

Each stage of the pipeline is exposed as a command via foli <subcommand>.

A note on input paths: All commands that accept input paths through the --input parameter allow files locations of absolute paths, relative paths, or glob patterns. Multiple paths/patterns are also allowed.

A note on paired-end files: Foli-seq uses paired-end sequencing. When inputing FASTQ files, only R1 path is given to the command and the corresponding R2 files are automatically derived by replacing the R1 pattern with R2 pattern (e.g., _R1__R2_, _1_2). Both R1 and R2 files are required for the pipeline to work.

Expected Inputs

  • Raw paired-end FASTQ files (following the naming pattern *_R1_*.fastq.gz and *_R2_*.fastq.gz)
  • Adapter FASTA files
  • STAR genome index directory
  • GTF annotation file

Output directories are automatically created if they don't exist.

Step 1. Preprocessing

foli qc \
    --input "/path/to/fastq/*_R1_*.fastq.gz" \
    --output-dir "/path/to/trimmed_reads"

This command performs read trimming using fastp and runs quality check on output files with fastqc and seqkit.

What happens under the hood:

  1. fastp performs quality-based trimming with tail trimming and error correction
  2. fastqc generates quality control reports for the trimmed reads
  3. seqkit stats compiles summary statistics for all output files

You can specify input FASTQ files from any location using absolute paths, relative paths, or glob patterns. The paired-end R2 files are automatically derived from R1 files by replacing _R1_ with _R2_ (or _1 with _2).

Output files (in the specified output directory):

  • {sample_name}_1.fq.gz and {sample_name}_2.fq.gz: Trimmed paired-end reads
  • {sample_name}.fastp.html and {sample_name}.fastp.json: fastp QC reports
  • FastQC reports in {output_dir}_fastqc/
  • {output_dir}.stats: Summary statistics table generated by seqkit

Step 2. Probe assignment

foli assign-probes \
    --input "/path/to/trimmed_reads/*_1.fq.gz" \
    --output-dir "/path/to/umi_tagged" \
    --i5 "/path/to/i5_short.fasta" \
    --i7 "/path/to/i7_short.fasta"

This command performs gene probe primer assignment and trimming using cutadapt. This step extracts UMI sequences from adapter matches and embeds them in read names for downstream processing.

What happens under the hood:

  1. cutadapt identifies i5 and i7 adapters in the reads (minimum 20bp overlap, 10% error rate)
  2. Adapter sequences and primer names are added to read headers
  3. A custom Python script (folitools.add_umi) parses the adapter information and embeds UMI sequences into read names in a structured format
  4. seqkit stats generates summary statistics for the output files

After probe assignment, you can get read statistics:

foli get-read-stats \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-dir "/path/to/read_stats" \
    --overwrite

This analyzes the UMI-tagged reads to generate detailed statistics about primer assignment and read characteristics.

Output files:

  • {sample_name}_1.fq.gz and {sample_name}_2.fq.gz: UMI-tagged, adapter-trimmed reads with UMI information embedded in read names
  • {output_dir}.stats: Summary statistics table generated by seqkit
  • Read statistics: {sample_name}.parquet files in the stats output directory (when using foli get-read-stats)

Step 3. Mapping and Feature Counting

foli map \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-bam "/path/to/mapped_reads" \
    --output-star "/path/to/star" \
    --star-index "/path/to/star_index" \
    --gtf "/path/to/annotation.gtf"

This command performs streamlined alignment and feature counting using STAR and featureCounts.

What happens under the hood:

  1. STAR index preloading: The genome index is loaded into shared memory once for all samples to improve processing speed
  2. Read filtering: Short reads (< 60bp, considered primer dimers) are removed using cutadapt
  3. Alignment: Filtered reads are aligned to the genome using STAR with:
    • Support for up to 1000 multi-mapping locations
    • Unmapped reads included in output (--outSAMunmapped Within)
    • Chimeric alignments detected and marked (--chimOutType WithinBAM)
  4. Feature assignment: featureCounts assigns aligned reads to genomic features from the GTF file
  5. BAM tagging and sorting:
    • UMI sequences are added as BAM tags (US for raw UMI, UC for filtered UMI)
    • Cell barcodes added as CB tag
    • Gene assignments added as XF tag
    • BAM files are coordinate-sorted using sambamba

Core allocation: The --cores parameter is automatically split between STAR (70%) and featureCounts (30%) to optimize pipeline throughput.

For fractional counting of reads that overlap multiple features, you can use the --allow-overlap and --allow-multimapping flags:

foli map \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-bam "/path/to/mapped_reads" \
    --output-star "/path/to/star" \
    --star-index "/path/to/star_index" \
    --gtf "/path/to/annotation.gtf" \
    --allow-overlap \
    --allow-multimapping

These flags enable fractional counting where reads overlapping multiple features or mapping to multiple locations contribute fractionally to each feature.

Output files:

  • STAR alignment files in {output_star}/{sample_name}/:
    • Aligned.out.bam: Unsorted alignment file
    • Log.final.out, Log.out, Log.progress.out: STAR log files
    • ReadsPerGene.out.tab: Gene counts from STAR
  • Sorted, tagged BAM files in {output_bam}/:
    • {sample_name}.sorted.bam: Final sorted BAM with UMI and gene tags
    • {sample_name}.sorted.bam.bai: BAM index file
  • featureCounts results in {output_bam}/:
    • {sample_name}.featureCounts: Read-feature assignments
    • {sample_name}.featureCounts.summary: Assignment statistics

Step 4. UMI-based Gene Counting

foli count \
    --input "/path/to/mapped_reads/*.sorted.bam" \
    --output-dir "/path/to/count_results"

This command processes BAM files using umi_tools group to generate UMI-deduplicated count data.

What happens under the hood:

  1. UMI grouping: umi_tools group identifies and groups reads with:
    • The same UMI sequence (extracted from UC tag)
    • The same cell barcode (CB tag)
    • The same gene assignment (XF tag)
  2. Read handling:
    • Paired-end reads are processed together
    • Unmapped reads, chimeric pairs, and unpaired reads are all retained in the output
    • Only primary alignments are used for grouping (secondary/supplementary alignments are ignored)
  3. Deduplication method: Uses the "unique" method where only reads with identical UMI sequences are grouped together
  4. Output generation: Produces detailed grouping information showing which reads were collapsed into each UMI group

Input BAM files should contain UMI tags (UC), cell tags (CB), and gene assignment tags (XF) from the previous mapping step.

After counting, generate the final count matrix:

foli get-count-mtx \
    --input "/path/to/count_results/*.group.tsv.gz" \
    --output "/path/to/final_counts.tsv" \
    --gtf "/path/to/annotation.gtf"

What happens under the hood:

  1. Reads all UMI grouping tables from umi_tools group output
  2. Counts unique UMI groups per gene per sample
  3. If GTF is provided, maps gene IDs to gene symbols
  4. Generates a gene × sample count matrix

You can also use the Python function directly:

from folitools.get_matrix import read_counts
df = read_counts("/path/to/input_files", "/path/to/annotation.gtf")
df.to_csv("/path/to/output.tsv", sep="\t", index_label="gene")

Output files:

  • {sample_name}.group.tsv.gz: UMI grouping tables with detailed information about which reads were grouped together
  • {sample_name}.group.log: Processing logs showing statistics for each sample
  • Final count matrix: Gene × sample matrix in TSV format with gene symbols (if GTF provided) or gene IDs

Primer Selection Functionality

The primer selection module provides tools for designing and recovering PCR primer sets for targeted amplicon sequencing experiments.

Workflow

The primer selection workflow provides a complete pipeline for primer design:

# Run complete primer design workflow using built-in reference
foli-primer workflow \
    --genes "genes.tsv" \
    --species "mouse" \
    --output-dir "primer_design/"

# Or use custom transcriptome FASTA for better coverage
foli-primer workflow \
    --genes "genes.tsv" \
    --txome-fasta "/path/to/transcriptome.fasta" \
    --output-dir "primer_design/"

Input: Gene table TSV with columns gene and group
Output: Complete primer design including optimized primer sets, amplicon sequences, and IDT-compatible ordering files

Note: You can use either --species (mouse/human) for built-in references or --txome-fasta for custom transcriptome files. Custom FASTA files often provide better gene coverage than the built-in references.

Recover

The recover functionality helps validate and analyze primer sets from IDT order files:

# Recover primer information using built-in reference
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --species "human"

# Or use custom transcriptome FASTA (recommended for better coverage)
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --txome-fasta "/path/to/transcriptome.fasta"

This command analyzes primer sequences from an IDT order file, validates them against a reference transcriptome, and generates output files for downstream analysis. The output FASTA files (i5_short.fasta and i7_short.fasta) will contain the same number of unique primer sequences as found in the input Excel file, ensuring all original primers are represented.

Note: Using --txome-fasta with a custom transcriptome file is recommended over the built-in --species references as it typically provides better gene coverage than the packaged Gencode references.

You can also specify the amplicon length range, and whether the primers have linkers:

foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --txome-fasta "/path/to/transcriptome.fasta" \
    --has-linker \
    --amplicon-length-range 300 400

Input: IDT order Excel file with primer sequences
Output:

  • summary_primer_to_order.xlsx: Validated primer summary with amplicon information
  • primer_diagnose.pdf: PDF report with analysis plots and statistics
  • i5_short.fasta and i7_short.fasta: FASTA files for sequencing adapters

Dimer Evaluation (through Python)

A primer set can be evaluated by calculating the thermodynamic properties of potential primer dimers for each pair of primers.

from folitools.primer_selection.eval_dimer import dimer_thermo_property

# Evaluate primer-dimer interactions
result_matrix = dimer_thermo_property(
    primer_fwd_fasta="forward_primers.fasta",
    primer_rev_fasta="reverse_primers.fasta", 
    output_dir="dimer_analysis/",
    output_suffix="_analysis"
)

Input: FASTA files containing forward and reverse primer sequences
Output:

  • Pairwise interaction matrix with thermodynamic properties
  • Detailed analysis files in the output directory

Testing

Run all tests:

pytest

Run only shell script tests:

pytest tests/test_shell_scripts.py -v

Run tests with coverage:

pytest --cov=src/folitools --cov-report=html

In silico amplification for primer pool evaluation

Given a list of primer sequences used in foli-seq (with transcript-specific sequences, TruSeq R1/R2 overhang and with or without linker sequences), we validate the primer pool with de novo in silico PCR, without prior knowledge of the target genes, so that the upstream gene selection & panel design is decoupled from the evaluation here.

This functionality helps you understand which genes are actually targeted by your primer pool and identify potential issues like off-target binding or missing amplicons.

The workflow consists of the following steps:

  1. Primer parsing and classification: The input IDT order Excel file (first two columns: pool name and primer sequence) is parsed to extract primer sequences. Each primer is classified as forward or reverse based on matching against known universal prefixes:

    • Forward: ACACTCTTTCCCTACACGACGCTCTTCCGATCT (TruSeq R1 adapter) + NNNNNN (UMI) + optional ACATCA linker
    • Reverse: GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (TruSeq R2 adapter) + NNNNNN (UMI) + optional ATAGTT linker

    After classification, the universal prefix and UMI are stripped to extract the gene-specific targeting sequence for each primer.

  2. Transcriptome mapping: Using seqkit locate, all gene-specific primer sequences are mapped against the provided transcriptome FASTA to identify where each primer can bind. This uses exact matching (no mismatches) to find all potential binding sites across all transcripts.

  3. Amplicon enumeration: Forward and reverse primer hits are systematically paired on each transcript to enumerate all possible amplicons. For a valid amplicon:

    • Forward primer must match the transcript sequence (sense strand)
    • Reverse primer must match the reverse complement (antisense strand)
    • Primers must be on opposite strands with proper orientation
    • Forward primer position must be upstream of reverse primer position
  4. Amplicon filtering: The enumerated amplicons are filtered based on:

    • Length constraints (default 320-380 bp, configurable via --amplicon-length-range)
    • Primer type validation (must be complementary R1 + R2 pair)

    Amplicons outside the target length range or with incorrect primer type combinations are excluded.

  5. Grouping and summarization: Valid amplicons are grouped by unique primer pairs. For each pair, the function:

    • Aggregates all transcripts and genes that can be amplified
    • Detects multi-mapping primers (primers binding to multiple genes)
    • Checks for cross-pool amplicons (primer pair from different pools)
    • Creates a semicolon-separated list of all transcript mappings
  6. Gene name simplification (optional, enabled by default): Gene symbols are processed to remove redundant information and improve readability. For example, gene families like "Gm12345" or immunoglobulin segments are consolidated to avoid cluttered names.

Output Files

The command generates several output files in the specified output directory, saved in the order they are produced during the workflow:

  1. locate_df.csv: Raw output from seqkit locate. Contains all matches of gene-specific primer sequences in the transcriptome. This is the unprocessed mapping result before any processing. Columns:

    • seqID: Transcript identifier from FASTA header (pipe-separated format)
    • pattern: The gene-specific primer sequence that was searched
    • strand: Strand orientation (+ or -)
    • start, end: Starting and ending positions of match on transcript (1-based, inclusive)
    • matched: The actual matched sequence from the transcript
  2. locate_df_final.csv: Processed version of locate_df with additional metadata joined from the primer info and parsed transcript/gene information. Columns are:

    • primer_seq, primer_seq_full: Gene-specific targeting sequence and complete primer (with universal prefix and UMI)
    • primer_type: Forward (fwd) or reverse (rev) classification
    • transcript_id, gene_id, gene_symbol: Transcript/gene identifiers parsed from seqID (indices 0, 1, and 5)
    • start, end: Starting and ending positions of match on transcript (1-based, inclusive)
    • strand: Strand orientation (+ or -)
    • pool: Pool assignment from IDT order file

    This file shows where each primer can bind across the transcriptome with full biological context.

  3. amplicon_all.csv: Complete enumeration of all possible amplicons formed by pairing forward and reverse primers on the same transcript. Each row represents one potential amplicon. Columns are:

    • primer_seq_fwd, primer_seq_rev: Forward and reverse primer gene-specific sequences
    • primer_seq_fwd_type, primer_seq_rev_type: Primer type classifications (should be fwd and rev)
    • transcript_id, gene_id, gene_symbol: Transcript and gene identifiers
    • start_up, end_up: Upstream (forward) primer binding site positions on transcript (1-based, inclusive)
    • start_down, end_down: Downstream (reverse) primer binding site positions on transcript (1-based, inclusive)
    • pool_fwd, pool_rev: Pool assignments for forward and reverse primers
    • flipped: Boolean indicating if primer pair orientation was flipped during enumeration
    • amplicon_length: Length of amplicon in base pairs (calculated as end_down - start_up + 1)
    • pass: Boolean indicating whether the amplicon passes filtering criteria (length range and complementary primer types)

    This includes all enumerated amplicons. Filter by pass == True to get only valid amplicons that would be expected in the actual experiment.

  4. grouped.csv: Summary table where amplicons are grouped by unique primer pairs. Each row represents one primer pair with aggregated information across all its amplicons. Columns are:

    • primer_seq_fwd, primer_seq_rev: Forward and reverse primer gene-specific sequences
    • transcript_id, gene_id, gene_symbol: Representative transcript/gene identifiers (from first amplicon in group sorted by gene_id and transcript_id)
    • start_up, end_up: Forward primer binding site positions on representative transcript (1-based, inclusive)
    • start_down, end_down: Reverse primer binding site positions on representative transcript (1-based, inclusive)
    • pool_fwd, pool_rev: Pool assignments for forward and reverse primers
    • num_transcripts, num_genes: Number of unique transcripts and genes amplified by this primer pair
    • transcript_id_all: Semicolon-separated list of all transcript-gene mappings for this pair (format: transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length)

    This shows the specificity of each primer pair, reveals multi-gene amplification, and provides detailed binding site information for each transcript target.

  5. primer_diagnose.pdf: A diagnostic PDF report containing a log-log histogram of amplicon lengths across all enumerated amplicons, with vertical dashed lines indicating the target length range. This visualization helps identify the distribution of amplicon sizes and assess whether the specified range captures the intended products.

  6. summary_primer_to_order.xlsx: Final Excel summary compatible with the primer design workflow format. Contains columns:

    • Group, geneSymbol, geneID: Default group assignment, gene symbol(s), and gene ID (without version)
    • Chosen Index, amplicon_index: Design index (default 0) and unique identifier (transcript ID without version + chosen index)
    • L_seq, R_seq: Forward/reverse primer display sequences (with linker, excluding universal prefix and UMI)
    • primer_sequence_to_order_forward, primer_sequence_to_order_reverse: Complete primer sequences as they appear in the order file (with universal prefix, UMI, and linker)
    • L_pool, R_pool: Pool assignments for forward and reverse primers
    • multi_mapping: Semicolon-separated list of all transcript-gene mappings (format: transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length)
  7. i5_short.fasta & i7_short.fasta: FASTA files containing gene-specific sequences for forward (i5) and reverse (i7) primers respectively. These sequences include any linker sequences but exclude the universal TruSeq adapters and UMIs. The number of records matches the number of unique primer sequences in the input order file. These files are suitable for use in downstream SADDLE analysis or dimer prediction.

  8. recover.log: Detailed log file capturing all processing steps, sanity checks, and warnings. The log includes both truncated output (for readability) and full details for comprehensive debugging.

Primer Naming Convention

In the output FASTA files (i5_short.fasta and i7_short.fasta), primer names (record IDs) are constructed based on the genes they amplify:

  • Single Gene Mapping: If a primer sequence maps to only one gene (e.g., GAPDH), the record ID is simply the gene symbol: GAPDH.

  • Multi-Gene Mapping: If a primer sequence maps to multiple genes, the record ID is a pipe-separated list of all unique gene symbols in sorted order. For example, a primer targeting both GENE_A and GENE_B would have ID: GENE_A|GENE_B.

  • Gene Name Simplification: When --simplify-gene-name is enabled (default), redundant tokens in gene symbols are removed. For instance, if a primer maps to both Ighv1-1 and Ighv1-2, these might be collapsed to a family representative like Ighv1 to avoid overly long names.

  • ID Collision Resolution: If multiple unique primer sequences map to the exact same set of genes (resulting in identical record IDs), a numeric suffix is automatically appended to distinguish them. For example: GAPDH, GAPDH.1, GAPDH.2. This ensures all primers are represented even when they target the same genes.

  • Missing Mappings: Primers from the input excel file that don't appear in any valid amplicon after filtering (e.g., no transcriptome match, or all amplicons fail length/type filters) are still included in the output FASTA files with ID _NA to ensure completeness. This guarantees that the output FASTA files contain exactly the same number of unique primer sequences as in the input order file.

Gene Name List Consolidation

When a primer maps to multiple genes, the gene name simplification feature (enabled by default with --simplify-gene-name) applies intelligent filtering rules to produce clean, informative primer names. The consolidation follows a two-stage process, controlled by the collapse_families argument (default: True). When collapse_families=False, only Stage 1 (pseudo-like filtering) is applied; otherwise, both stages are applied.

Stage 1: Pseudo-like Gene Filtering

The first stage identifies and removes "pseudo-like" gene symbols—names that are typically uninformative placeholders or annotation artifacts. A gene symbol is considered pseudo-like if it matches any of these patterns:

  • Database identifiers: Ensembl gene IDs (e.g., ENSG00000173366, ENSMUSG00000026193)
  • Mouse gene models: Placeholder names like Gm10358, Gm3839
  • RIKEN cDNA clones: Mouse placeholders ending in Rik (e.g., 1700003D09Rik)
  • Long non-coding RNA placeholders: Patterns like MMP24OS, FOXP3-AS1, LINC01234, AC123456.1, AL123456.1, RP11-...
  • Explicit pseudogene markers: Suffixes like -ps, -rs, or RNAxxSPx (e.g., Clca4c-ps, Rn18s-rs5, RNA18SP5)
  • Pseudogene patterns: Genes ending in digits followed by P (e.g., DEFA10P, CLCA3P, KRT8P1)
    • Exception list protects real genes: FOXP1-4, GBP1, ZBP1, GTPBP1, SPP1, DUSP1, AKAP1, TRAP1, RAMP1
  • Paralog-like variants: Patterns like Klk1b1 where a base ending in a digit is followed by lowercase letters and more digits

Filtering rule: If at least one non-pseudo-like gene symbol exists in the list, all pseudo-like symbols are dropped. If all symbols are pseudo-like, they are all retained to avoid losing all information.

Special cases:

  • Mitochondrial genes (prefixed with MT- or mt-) are always considered informative and kept
  • Readthrough genes (format GENE_A-GENE_B) are dropped only if both component genes (GENE_A and GENE_B) also appear separately in the list; otherwise the readthrough symbol is kept as informative

Stage 2: Family Collapsing

After pseudo-like filtering, the second stage collapses gene family members with trivial suffix variants—but only when a clear base gene exists. This prevents overly verbose names while avoiding arbitrary choices:

  • Paralog families: Patterns like Klk1b1, Klk1b3, Klk1b11, Klk1b14-ps are recognized as family members of base Klk1
    • Trivial tails must have form: <base><lowercase_letters><digits> or <base><digits><lowercase_letters>, optionally with -ps suffix
    • Base must be at least 4 characters long
  • Pseudogene families: Patterns like KRT8P1, KRT8P2 are recognized as pseudogene variants of base KRT8
    • Pattern: <base_ending_in_digit>P<digits>

Collapsing rule: Family members are collapsed only if the base gene is present in the list. For example:

  • ["Klk1", "Klk1b1", "Klk1b3"]["Klk1"] (base present, collapse all variants)
  • ["Klk1b1", "Klk1b3"]["Klk1b1", "Klk1b3"] (base absent, keep both variants separately)
  • ["Selenbp1", "Selenbp2"]["Selenbp1", "Selenbp2"] (these are distinct paralogs, not trivial variants)

This conservative approach ensures that family members are only merged when there's clear evidence (presence of the base gene) that they represent variants of the same functional unit, avoiding silent loss of information.

Examples

  • ["GAPDH", "ENSG00000111640"]["GAPDH"] (Ensembl ID dropped)
  • ["Ppp1r1b", "1700003D09Rik"]["Ppp1r1b"] (RIKEN clone dropped)
  • ["DEFA1", "DEFA10P"]["DEFA1"] (pseudogene dropped)
  • ["Klk1", "Klk1b1", "Klk1b11", "Klk1b3"]["Klk1"] (family collapsed to base)
  • ["KRT8P1", "KRT8P2"]["KRT8P1", "KRT8P2"] (no base KRT8, so keep both)
  • ["IFNAR2-IL10RB"]["IFNAR2-IL10RB"] (readthrough kept when components absent)
  • ["KLRC4-KLRK1", "KLRK1"]["KLRK1"] (readthrough dropped when component present)

This simplification is applied when constructing FASTA record IDs from multi-gene mappings, making primer names more concise and biologically meaningful while preserving essential information.

Understanding Amplicon Counting Output

Foli-seq's amplicon-based design produces count matrices with feature names that differ from traditional bulk RNA-seq or scRNA-seq outputs. Because the protocol uses targeted primers that can amplify multiple amplicons (due to gene family similarity) and multiple primers can target the same transcript (or transcripts of the same gene), the feature naming reflects both gene identity and multi-mapping relationships.

During the pipeline, feature names evolve through several stages. When cutadapt identifies primer adapters in step 2, primer names are embedded into read names (e.g., @READ_ID_ACGTAC_TGCATG_FGR+FGR), creating amplicon-specific identifiers. In step 3, featureCounts assigns reads to gene features and stores results in the XT tag—either single genes (XT:Z:ENSG00000010030) or comma-separated lists for multi-mapping reads (XT:Z:ENSG00000010030,ENSG00000285441). The pipeline then combines these gene assignments with primer information to create composite features in the XF tag, such as ENSG00000010030,FGR+FGR (single gene amplified by FGR primers) or ENSG00000010030,ENSG00000285441,FGR+FGR (multi-mapping).

umi_tools group deduplicates reads by grouping on UMI sequence, cell barcode, and the gene-with-primer assignment from the XF tag. Finally, the read_counts function aggregates these amplicon-level counts to gene-level by stripping primer suffixes from feature names. When multiple primer pairs target the same transcript or transcripts of the same gene, their counts are averaged. Multi-mapping reads remain as pipe-separated features (e.g., GENE_A|GENE_B). If a GTF file is provided, gene IDs are converted to symbols while preserving multi-mapping relationships (e.g., ENSG00000010030FGR, or ENSG00000111|ENSG00000222Klk1b1|Klk1b3). For multi-gene features, the same gene name simplification logic described in the "Gene Name List Consolidation" section above is applied to remove pseudo-like genes and collapse family variants, ensuring consistent naming between primer pool validation and count matrix generation.

In summary, the resulting count matrix has samples as rows and "enhanced" gene features as columns. Single-mapping genes appear as simple gene symbols (FGR, GAPDH), while multi-mapping gene families are indicated with pipe-separated names (BCL9L|CXCR5). Thus, the values represent UMI-deduplicated read counts where each UMI corresponds to one original mRNA molecule, and multiple amplicons targeting the same gene are averaged. This design handles amplicon multiplicity through count aggregation and preserves biological ambiguity due to sequence similarity, providing gene-level quantification suitable for differential expression analysis.

TODOs

  • Add QC report code that summarizes all metrics into a table (need refactoring)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folitools-0.5.0.tar.gz (135.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

folitools-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

folitools-0.5.0-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (1.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file folitools-0.5.0.tar.gz.

File metadata

  • Download URL: folitools-0.5.0.tar.gz
  • Upload date:
  • Size: 135.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for folitools-0.5.0.tar.gz
Algorithm Hash digest
SHA256 ba548579fadbb94ac8ca93743259378ec377b689c5e1605a5783ee8e82236b4a
MD5 77db9b63bb4736b695e91ddeeb650a22
BLAKE2b-256 23f7384ecdcad1dd65db35673ef9475aaea63fabd0f96370b02c3f46bccc80bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for folitools-0.5.0.tar.gz:

Publisher: publish.yml on whatever60/folitools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file folitools-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for folitools-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d17fde9d3f53423d634206abd121a5d02b59a5cf7e51ad7ed225efed02c27d0c
MD5 d20ecd240a6ccfe30468123a21e79e57
BLAKE2b-256 7889ff15abb2417a6998dc3ed25b7ba9860743cd7b98af995cdd608493760f7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for folitools-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl:

Publisher: publish.yml on whatever60/folitools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file folitools-0.5.0-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for folitools-0.5.0-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 090b6e803bc76e9e3cae7b78ac7d64db25121063c742baaecf329d08c584ed7a
MD5 98d2e32d7a4ea03631363f578dfe7f64
BLAKE2b-256 1a17eb2ec5c90bbe1c94da587161626b4eee720ca8a9146f99c9153bb1d75eb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for folitools-0.5.0-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish.yml on whatever60/folitools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page