Foli-seq data processing and analysis tools

Project description

folitools

folitools is a modular CLI toolkit and python library for Foli-seq data processing.

Foli-seq is a amplicon-based high-throughput sequencing method for profiling host gene expression from feces. Foli-seq is developed to probe amplicons of 320-380bp and sequence on PE150 Illumina sequencers.

Installation

Install from PyPI:

pip install folitools

The dependencies from conda are also required (note: cutadapt is intentionally not in this list — see below):

conda install -c bioconda -c conda-forge \
  fastp samtools bwa-mem2 star seqkit fastqc subread sambamba pigz gcc

Cutadapt fork

Starting with folitools 0.4.0, cutadapt is installed from our fork, published on PyPI as cutadapt-folitools (source: whatever60/cutadapt@folitools-perf). The fork installs the same cutadapt Python module and cutadapt console script as upstream, so nothing in your own tooling needs to change. It adds performance improvements that matter for the foli assign-probes workload (hundreds of i5/i7 primers, paired-end gzipped FASTQ inputs):

SeedMultiAdapterFilter — a seed-and-extend pre-filter for large non-anchored multi-adapter groups. Match objects are bit-identical to upstream cutadapt's per-adapter iteration; only the speed changes. Auto-activates above 8 eligible adapters in a group.
--input-compression-threads N — opt-in parallel decompression of gzipped inputs via pigz.
NPrefixIndexedAdapters — a dormant class that offers a fast path for ^N{k}<body> anchored adapters. Not wired into the auto path; reserved for future opt-in use.

End-to-end on a representative 3.1 M read-pair sample with 542×2 primers (-j 16), the combined effect of the fork plus the xopen/pigz-backed writer in add_umi cuts one sample from ~7:30 wall time to ~2:45 (~2.7× faster).

cutadapt-folitools is published as an sdist, so pip install folitools will compile cutadapt from source. A C compiler is required — the gcc entry in the conda command above covers it on Linux. On macOS the Xcode command-line tools suffice.

If you already have the upstream cutadapt wheel installed from PyPI it will be replaced by cutadapt-folitools when you install folitools, because the two distributions install the same cutadapt module and console script. Don't install both side-by-side.

If you would prefer the upstream cutadapt wheel on PyPI, you can skip the fork with:

pip install --no-deps folitools
pip install "cutadapt>=5.1" <other-deps>

but the foli assign-probes stage will be slower.

Usage

Each stage of the pipeline is exposed as a command via foli <subcommand>.

A note on input paths: All commands that accept input paths through the --input parameter allow files locations of absolute paths, relative paths, or glob patterns. Multiple paths/patterns are also allowed.

A note on paired-end files: Foli-seq uses paired-end sequencing. When inputing FASTQ files, only R1 path is given to the command and the corresponding R2 files are automatically derived by replacing the R1 pattern with R2 pattern (e.g., _R1_ → _R2_, _1 → _2). Both R1 and R2 files are required for the pipeline to work.

Expected Inputs

Raw paired-end FASTQ files (following the naming pattern *_R1_*.fastq.gz and *_R2_*.fastq.gz)
Adapter FASTA files
STAR genome index directory
GTF annotation file

Output directories are automatically created if they don't exist.

Step 1. Preprocessing

foli qc \
    --input "/path/to/fastq/*_R1_*.fastq.gz" \
    --output-dir "/path/to/trimmed_reads"

This command performs read trimming using fastp and runs quality check on output files with fastqc and seqkit.

What happens under the hood:

fastp performs quality-based trimming with tail trimming and error correction
fastqc generates quality control reports for the trimmed reads
seqkit stats compiles summary statistics for all output files

You can specify input FASTQ files from any location using absolute paths, relative paths, or glob patterns. The paired-end R2 files are automatically derived from R1 files by replacing _R1_ with _R2_ (or _1 with _2).

Output files (in the specified output directory):

{sample_name}_1.fq.gz and {sample_name}_2.fq.gz: Trimmed paired-end reads
{sample_name}.fastp.html and {sample_name}.fastp.json: fastp QC reports
FastQC reports in {output_dir}_fastqc/
{output_dir}.stats: Summary statistics table generated by seqkit

Step 2. Probe assignment

foli assign-probes \
    --input "/path/to/trimmed_reads/*_1.fq.gz" \
    --output-dir "/path/to/umi_tagged" \
    --i5 "/path/to/i5_short.fasta" \
    --i7 "/path/to/i7_short.fasta"

This command performs gene probe primer assignment and trimming using cutadapt. This step extracts UMI sequences from adapter matches and embeds them in read names for downstream processing.

What happens under the hood:

cutadapt identifies i5 and i7 adapters in the reads (minimum 20bp overlap, 10% error rate)
Adapter sequences and primer names are added to read headers
A custom Python script (folitools.add_umi) parses the adapter information and embeds UMI sequences into read names in a structured format
seqkit stats generates summary statistics for the output files

After probe assignment, you can get read statistics:

foli get-read-stats \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-dir "/path/to/read_stats" \
    --overwrite

This analyzes the UMI-tagged reads to generate detailed statistics about primer assignment and read characteristics.

Output files:

{sample_name}_1.fq.gz and {sample_name}_2.fq.gz: UMI-tagged, adapter-trimmed reads with UMI information embedded in read names
{output_dir}.stats: Summary statistics table generated by seqkit
Read statistics: {sample_name}.parquet files in the stats output directory (when using foli get-read-stats)

Step 3. Mapping and Feature Counting

foli map \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-bam "/path/to/mapped_reads" \
    --output-star "/path/to/star" \
    --star-index "/path/to/star_index" \
    --gtf "/path/to/annotation.gtf"

This command performs streamlined alignment and feature counting using STAR and featureCounts.

What happens under the hood:

STAR index preloading: The genome index is loaded into shared memory once for all samples to improve processing speed
Read filtering: Short reads (< 60bp, considered primer dimers) are removed using cutadapt
Alignment: Filtered reads are aligned to the genome using STAR with:
- Support for up to 1000 multi-mapping locations
- Unmapped reads included in output (--outSAMunmapped Within)
- Chimeric alignments detected and marked (--chimOutType WithinBAM)
Feature assignment: featureCounts assigns aligned reads to genomic features from the GTF file
BAM tagging and sorting:
- UMI sequences are added as BAM tags (US for raw UMI, UC for filtered UMI)
- Cell barcodes added as CB tag
- Gene assignments added as XF tag
- BAM files are coordinate-sorted using sambamba

Core allocation: The --cores parameter is automatically split between STAR (70%) and featureCounts (30%) to optimize pipeline throughput.

For fractional counting of reads that overlap multiple features, you can use the --allow-overlap and --allow-multimapping flags:

foli map \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-bam "/path/to/mapped_reads" \
    --output-star "/path/to/star" \
    --star-index "/path/to/star_index" \
    --gtf "/path/to/annotation.gtf" \
    --allow-overlap \
    --allow-multimapping

These flags enable fractional counting where reads overlapping multiple features or mapping to multiple locations contribute fractionally to each feature.

Output files:

STAR alignment files in {output_star}/{sample_name}/:
- Aligned.out.bam: Unsorted alignment file
- Log.final.out, Log.out, Log.progress.out: STAR log files
- ReadsPerGene.out.tab: Gene counts from STAR
Sorted, tagged BAM files in {output_bam}/:
- {sample_name}.sorted.bam: Final sorted BAM with UMI and gene tags
- {sample_name}.sorted.bam.bai: BAM index file
featureCounts results in {output_bam}/:
- {sample_name}.featureCounts: Read-feature assignments
- {sample_name}.featureCounts.summary: Assignment statistics

Step 4. UMI-based Gene Counting

foli count \
    --input "/path/to/mapped_reads/*.sorted.bam" \
    --output-dir "/path/to/count_results"

This command processes BAM files using umi_tools group to generate UMI-deduplicated count data.

What happens under the hood:

UMI grouping: umi_tools group identifies and groups reads with:
- The same UMI sequence (extracted from UC tag)
- The same cell barcode (CB tag)
- The same gene assignment (XF tag)
Read handling:
- Paired-end reads are processed together
- Unmapped reads, chimeric pairs, and unpaired reads are all retained in the output
- Only primary alignments are used for grouping (secondary/supplementary alignments are ignored)
Deduplication method: Uses the "unique" method where only reads with identical UMI sequences are grouped together
Output generation: Produces detailed grouping information showing which reads were collapsed into each UMI group

Input BAM files should contain UMI tags (UC), cell tags (CB), and gene assignment tags (XF) from the previous mapping step.

After counting, generate the final count matrix:

foli get-count-mtx \
    --input "/path/to/count_results/*.group.tsv.gz" \
    --output "/path/to/final_counts.tsv" \
    --gtf "/path/to/annotation.gtf"

What happens under the hood:

Reads all UMI grouping tables from umi_tools group output
Counts unique UMI groups per gene per sample
If GTF is provided, maps gene IDs to gene symbols
Generates a gene × sample count matrix

You can also use the Python function directly:

from folitools.get_matrix import read_counts
df = read_counts("/path/to/input_files", "/path/to/annotation.gtf")
df.to_csv("/path/to/output.tsv", sep="\t", index_label="gene")

Output files:

{sample_name}.group.tsv.gz: UMI grouping tables with detailed information about which reads were grouped together
{sample_name}.group.log: Processing logs showing statistics for each sample
Final count matrix: Gene × sample matrix in TSV format with gene symbols (if GTF provided) or gene IDs

Primer Selection Functionality

The primer selection module provides tools for designing and recovering PCR primer sets for targeted amplicon sequencing experiments.

Workflow

The primer selection workflow provides a complete pipeline for primer design:

# Run complete primer design workflow using built-in reference
foli-primer workflow \
    --genes "genes.tsv" \
    --species "mouse" \
    --output-dir "primer_design/"

# Or use custom transcriptome FASTA for better coverage
foli-primer workflow \
    --genes "genes.tsv" \
    --txome-fasta "/path/to/transcriptome.fasta" \
    --output-dir "primer_design/"

Input: Gene table TSV with columns gene and group
Output: Complete primer design including optimized primer sets, amplicon sequences, and IDT-compatible ordering files

Note: You can use either --species (mouse/human) for built-in references or --txome-fasta for custom transcriptome files. Custom FASTA files often provide better gene coverage than the built-in references.

Recover

The recover functionality helps validate and analyze primer sets from IDT order files:

# Recover primer information using built-in reference
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --species "human"

# Or use custom transcriptome FASTA (recommended for better coverage)
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --txome-fasta "/path/to/transcriptome.fasta"

This command analyzes primer sequences from an IDT order file, validates them against a reference transcriptome, and generates output files for downstream analysis. The output FASTA files (i5_short.fasta and i7_short.fasta) will contain the same number of unique primer sequences as found in the input Excel file, ensuring all original primers are represented.

Note: Using --txome-fasta with a custom transcriptome file is recommended over the built-in --species references as it typically provides better gene coverage than the packaged Gencode references.

You can also specify the amplicon length range, and whether the primers have linkers:

foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --txome-fasta "/path/to/transcriptome.fasta" \
    --has-linker \
    --amplicon-length-range 300 400

Input: IDT order Excel file with primer sequences
Output:

summary_primer_to_order.xlsx: Validated primer summary with amplicon information
primer_diagnose.pdf: PDF report with analysis plots and statistics
i5_short.fasta and i7_short.fasta: FASTA files for sequencing adapters

Dimer Evaluation (through Python)

A primer set can be evaluated by calculating the thermodynamic properties of potential primer dimers for each pair of primers.

from folitools.primer_selection.eval_dimer import dimer_thermo_property

# Evaluate primer-dimer interactions
result_matrix = dimer_thermo_property(
    primer_fwd_fasta="forward_primers.fasta",
    primer_rev_fasta="reverse_primers.fasta", 
    output_dir="dimer_analysis/",
    output_suffix="_analysis"
)

Input: FASTA files containing forward and reverse primer sequences
Output:

Pairwise interaction matrix with thermodynamic properties
Detailed analysis files in the output directory

Testing

Run all tests:

pytest

Run only shell script tests:

pytest tests/test_shell_scripts.py -v

Run tests with coverage:

pytest --cov=src/folitools --cov-report=html

In silico amplification for primer pool evaluation

Given a list of primer sequences used in foli-seq (with transcript-specific sequences, TruSeq R1/R2 overhang and with or without linker sequences), we validate the primer pool with de novo in silico PCR, without prior knowledge of the target genes, so that the upstream gene selection & panel design is decoupled from the evaluation here.

This functionality helps you understand which genes are actually targeted by your primer pool and identify potential issues like off-target binding or missing amplicons.

The workflow consists of the following steps:

Primer parsing and classification: The input IDT order Excel file (first two columns: pool name and primer sequence) is parsed to extract primer sequences. Each primer is classified as forward or reverse based on matching against known universal prefixes:
- Forward: ACACTCTTTCCCTACACGACGCTCTTCCGATCT (TruSeq R1 adapter) + NNNNNN (UMI) + optional ACATCA linker
- Reverse: GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (TruSeq R2 adapter) + NNNNNN (UMI) + optional ATAGTT linker
After classification, the universal prefix and UMI are stripped to extract the gene-specific targeting sequence for each primer.
Transcriptome mapping: Using seqkit locate, all gene-specific primer sequences are mapped against the provided transcriptome FASTA to identify where each primer can bind. This uses exact matching (no mismatches) to find all potential binding sites across all transcripts.
Amplicon enumeration: Forward and reverse primer hits are systematically paired on each transcript to enumerate all possible amplicons. For a valid amplicon:
- Forward primer must match the transcript sequence (sense strand)
- Reverse primer must match the reverse complement (antisense strand)
- Primers must be on opposite strands with proper orientation
- Forward primer position must be upstream of reverse primer position
Amplicon filtering: The enumerated amplicons are filtered based on:
- Length constraints (default 320-380 bp, configurable via --amplicon-length-range)
- Primer type validation (must be complementary R1 + R2 pair)
Amplicons outside the target length range or with incorrect primer type combinations are excluded.
Grouping and summarization: Valid amplicons are grouped by unique primer pairs. For each pair, the function:
- Aggregates all transcripts and genes that can be amplified
- Detects multi-mapping primers (primers binding to multiple genes)
- Checks for cross-pool amplicons (primer pair from different pools)
- Creates a semicolon-separated list of all transcript mappings
Gene name simplification (optional, enabled by default): Gene symbols are processed to remove redundant information and improve readability. For example, gene families like "Gm12345" or immunoglobulin segments are consolidated to avoid cluttered names.

Output Files

The command generates several output files in the specified output directory, saved in the order they are produced during the workflow:

locate_df.csv: Raw output from seqkit locate. Contains all matches of gene-specific primer sequences in the transcriptome. This is the unprocessed mapping result before any processing. Columns:
- seqID: Transcript identifier from FASTA header (pipe-separated format)
- pattern: The gene-specific primer sequence that was searched
- strand: Strand orientation (+ or -)
- start, end: Starting and ending positions of match on transcript (1-based, inclusive)
- matched: The actual matched sequence from the transcript
locate_df_final.csv: Processed version of locate_df with additional metadata joined from the primer info and parsed transcript/gene information. Columns are:
- primer_seq, primer_seq_full: Gene-specific targeting sequence and complete primer (with universal prefix and UMI)
- primer_type: Forward (fwd) or reverse (rev) classification
- transcript_id, gene_id, gene_symbol: Transcript/gene identifiers parsed from seqID (indices 0, 1, and 5)
- start, end: Starting and ending positions of match on transcript (1-based, inclusive)
- strand: Strand orientation (+ or -)
- pool: Pool assignment from IDT order file
This file shows where each primer can bind across the transcriptome with full biological context.
amplicon_all.csv: Complete enumeration of all possible amplicons formed by pairing forward and reverse primers on the same transcript. Each row represents one potential amplicon. Columns are:
- primer_seq_fwd, primer_seq_rev: Forward and reverse primer gene-specific sequences
- primer_seq_fwd_type, primer_seq_rev_type: Primer type classifications (should be fwd and rev)
- transcript_id, gene_id, gene_symbol: Transcript and gene identifiers
- start_up, end_up: Upstream (forward) primer binding site positions on transcript (1-based, inclusive)
- start_down, end_down: Downstream (reverse) primer binding site positions on transcript (1-based, inclusive)
- pool_fwd, pool_rev: Pool assignments for forward and reverse primers
- flipped: Boolean indicating if primer pair orientation was flipped during enumeration
- amplicon_length: Length of amplicon in base pairs (calculated as end_down - start_up + 1)
- pass: Boolean indicating whether the amplicon passes filtering criteria (length range and complementary primer types)
This includes all enumerated amplicons. Filter by pass == True to get only valid amplicons that would be expected in the actual experiment.
grouped.csv: Summary table where amplicons are grouped by unique primer pairs. Each row represents one primer pair with aggregated information across all its amplicons. Columns are:
- primer_seq_fwd, primer_seq_rev: Forward and reverse primer gene-specific sequences
- transcript_id, gene_id, gene_symbol: Representative transcript/gene identifiers (from first amplicon in group sorted by gene_id and transcript_id)
- start_up, end_up: Forward primer binding site positions on representative transcript (1-based, inclusive)
- start_down, end_down: Reverse primer binding site positions on representative transcript (1-based, inclusive)
- pool_fwd, pool_rev: Pool assignments for forward and reverse primers
- num_transcripts, num_genes: Number of unique transcripts and genes amplified by this primer pair
- transcript_id_all: Semicolon-separated list of all transcript-gene mappings for this pair (format: transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length)
This shows the specificity of each primer pair, reveals multi-gene amplification, and provides detailed binding site information for each transcript target.
primer_diagnose.pdf: A diagnostic PDF report containing a log-log histogram of amplicon lengths across all enumerated amplicons, with vertical dashed lines indicating the target length range. This visualization helps identify the distribution of amplicon sizes and assess whether the specified range captures the intended products.
summary_primer_to_order.xlsx: Final Excel summary compatible with the primer design workflow format. Contains columns:
- Group, geneSymbol, geneID: Default group assignment, gene symbol(s), and gene ID (without version)
- Chosen Index, amplicon_index: Design index (default 0) and unique identifier (transcript ID without version + chosen index)
- L_seq, R_seq: Forward/reverse primer display sequences (with linker, excluding universal prefix and UMI)
- primer_sequence_to_order_forward, primer_sequence_to_order_reverse: Complete primer sequences as they appear in the order file (with universal prefix, UMI, and linker)
- L_pool, R_pool: Pool assignments for forward and reverse primers
- multi_mapping: Semicolon-separated list of all transcript-gene mappings (format: transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length)
i5_short.fasta & i7_short.fasta: FASTA files containing gene-specific sequences for forward (i5) and reverse (i7) primers respectively. These sequences include any linker sequences but exclude the universal TruSeq adapters and UMIs. The number of records matches the number of unique primer sequences in the input order file. These files are suitable for use in downstream SADDLE analysis or dimer prediction.
recover.log: Detailed log file capturing all processing steps, sanity checks, and warnings. The log includes both truncated output (for readability) and full details for comprehensive debugging.

Primer Naming Convention

In the output FASTA files (i5_short.fasta and i7_short.fasta), primer names (record IDs) are constructed based on the genes they amplify:

Single Gene Mapping: If a primer sequence maps to only one gene (e.g., GAPDH), the record ID is simply the gene symbol: GAPDH.
Multi-Gene Mapping: If a primer sequence maps to multiple genes, the record ID is a pipe-separated list of all unique gene symbols in sorted order. For example, a primer targeting both GENE_A and GENE_B would have ID: GENE_A|GENE_B.
Gene Name Simplification: When --simplify-gene-name is enabled (default), redundant tokens in gene symbols are removed. For instance, if a primer maps to both Ighv1-1 and Ighv1-2, these might be collapsed to a family representative like Ighv1 to avoid overly long names.
ID Collision Resolution: If multiple unique primer sequences map to the exact same set of genes (resulting in identical record IDs), a numeric suffix is automatically appended to distinguish them. For example: GAPDH, GAPDH.1, GAPDH.2. This ensures all primers are represented even when they target the same genes.
Missing Mappings: Primers from the input excel file that don't appear in any valid amplicon after filtering (e.g., no transcriptome match, or all amplicons fail length/type filters) are still included in the output FASTA files with ID _NA to ensure completeness. This guarantees that the output FASTA files contain exactly the same number of unique primer sequences as in the input order file.

Gene Name List Consolidation

When a primer maps to multiple genes, the gene name simplification feature (enabled by default with --simplify-gene-name) applies intelligent filtering rules to produce clean, informative primer names. The consolidation follows a two-stage process, controlled by the collapse_families argument (default: True). When collapse_families=False, only Stage 1 (pseudo-like filtering) is applied; otherwise, both stages are applied.

Stage 1: Pseudo-like Gene Filtering

The first stage identifies and removes "pseudo-like" gene symbols—names that are typically uninformative placeholders or annotation artifacts. A gene symbol is considered pseudo-like if it matches any of these patterns:

Database identifiers: Ensembl gene IDs (e.g., ENSG00000173366, ENSMUSG00000026193)
Mouse gene models: Placeholder names like Gm10358, Gm3839
RIKEN cDNA clones: Mouse placeholders ending in Rik (e.g., 1700003D09Rik)
Long non-coding RNA placeholders: Patterns like MMP24OS, FOXP3-AS1, LINC01234, AC123456.1, AL123456.1, RP11-...
Explicit pseudogene markers: Suffixes like -ps, -rs, or RNAxxSPx (e.g., Clca4c-ps, Rn18s-rs5, RNA18SP5)
Pseudogene patterns: Genes ending in digits followed by P (e.g., DEFA10P, CLCA3P, KRT8P1)
- Exception list protects real genes: FOXP1-4, GBP1, ZBP1, GTPBP1, SPP1, DUSP1, AKAP1, TRAP1, RAMP1
Paralog-like variants: Patterns like Klk1b1 where a base ending in a digit is followed by lowercase letters and more digits

Filtering rule: If at least one non-pseudo-like gene symbol exists in the list, all pseudo-like symbols are dropped. If all symbols are pseudo-like, they are all retained to avoid losing all information.

Special cases:

Mitochondrial genes (prefixed with MT- or mt-) are always considered informative and kept
Readthrough genes (format GENE_A-GENE_B) are dropped only if both component genes (GENE_A and GENE_B) also appear separately in the list; otherwise the readthrough symbol is kept as informative

Stage 2: Family Collapsing

After pseudo-like filtering, the second stage collapses gene family members with trivial suffix variants—but only when a clear base gene exists. This prevents overly verbose names while avoiding arbitrary choices:

Paralog families: Patterns like Klk1b1, Klk1b3, Klk1b11, Klk1b14-ps are recognized as family members of base Klk1
- Trivial tails must have form: <base><lowercase_letters><digits> or <base><digits><lowercase_letters>, optionally with -ps suffix
- Base must be at least 4 characters long
Pseudogene families: Patterns like KRT8P1, KRT8P2 are recognized as pseudogene variants of base KRT8
- Pattern: <base_ending_in_digit>P<digits>

Collapsing rule: Family members are collapsed only if the base gene is present in the list. For example:

["Klk1", "Klk1b1", "Klk1b3"] → ["Klk1"] (base present, collapse all variants)
["Klk1b1", "Klk1b3"] → ["Klk1b1", "Klk1b3"] (base absent, keep both variants separately)
["Selenbp1", "Selenbp2"] → ["Selenbp1", "Selenbp2"] (these are distinct paralogs, not trivial variants)

This conservative approach ensures that family members are only merged when there's clear evidence (presence of the base gene) that they represent variants of the same functional unit, avoiding silent loss of information.

Examples

["GAPDH", "ENSG00000111640"] → ["GAPDH"] (Ensembl ID dropped)
["Ppp1r1b", "1700003D09Rik"] → ["Ppp1r1b"] (RIKEN clone dropped)
["DEFA1", "DEFA10P"] → ["DEFA1"] (pseudogene dropped)
["Klk1", "Klk1b1", "Klk1b11", "Klk1b3"] → ["Klk1"] (family collapsed to base)
["KRT8P1", "KRT8P2"] → ["KRT8P1", "KRT8P2"] (no base KRT8, so keep both)
["IFNAR2-IL10RB"] → ["IFNAR2-IL10RB"] (readthrough kept when components absent)
["KLRC4-KLRK1", "KLRK1"] → ["KLRK1"] (readthrough dropped when component present)

This simplification is applied when constructing FASTA record IDs from multi-gene mappings, making primer names more concise and biologically meaningful while preserving essential information.

Understanding Amplicon Counting Output

Foli-seq's amplicon-based design produces count matrices with feature names that differ from traditional bulk RNA-seq or scRNA-seq outputs. Because the protocol uses targeted primers that can amplify multiple amplicons (due to gene family similarity) and multiple primers can target the same transcript (or transcripts of the same gene), the feature naming reflects both gene identity and multi-mapping relationships.

During the pipeline, feature names evolve through several stages. When cutadapt identifies primer adapters in step 2, primer names are embedded into read names (e.g., @READ_ID_ACGTAC_TGCATG_FGR+FGR), creating amplicon-specific identifiers. In step 3, featureCounts assigns reads to gene features and stores results in the XT tag—either single genes (XT:Z:ENSG00000010030) or comma-separated lists for multi-mapping reads (XT:Z:ENSG00000010030,ENSG00000285441). The pipeline then combines these gene assignments with primer information to create composite features in the XF tag, such as ENSG00000010030,FGR+FGR (single gene amplified by FGR primers) or ENSG00000010030,ENSG00000285441,FGR+FGR (multi-mapping).

umi_tools group deduplicates reads by grouping on UMI sequence, cell barcode, and the gene-with-primer assignment from the XF tag. Finally, the read_counts function aggregates these amplicon-level counts to gene-level by stripping primer suffixes from feature names. When multiple primer pairs target the same transcript or transcripts of the same gene, their counts are averaged. Multi-mapping reads remain as pipe-separated features (e.g., GENE_A|GENE_B). If a GTF file is provided, gene IDs are converted to symbols while preserving multi-mapping relationships (e.g., ENSG00000010030 → FGR, or ENSG00000111|ENSG00000222 → Klk1b1|Klk1b3). For multi-gene features, the same gene name simplification logic described in the "Gene Name List Consolidation" section above is applied to remove pseudo-like genes and collapse family variants, ensuring consistent naming between primer pool validation and count matrix generation.

In summary, the resulting count matrix has samples as rows and "enhanced" gene features as columns. Single-mapping genes appear as simple gene symbols (FGR, GAPDH), while multi-mapping gene families are indicated with pipe-separated names (BCL9L|CXCR5). Thus, the values represent UMI-deduplicated read counts where each UMI corresponds to one original mRNA molecule, and multiple amplicons targeting the same gene are averaged. This design handles amplicon multiplicity through count aggregation and preserves biological ambiguity due to sequence similarity, providing gene-level quantification suitable for differential expression analysis.

TODOs

Add QC report code that summarizes all metrics into a table (need refactoring)

Project details

Release history Release notifications | RSS feed

0.7.0

May 3, 2026

0.6.1

Apr 23, 2026

0.6.0

Apr 23, 2026

0.5.0

Apr 22, 2026

0.4.1

Apr 22, 2026

This version

0.4.0

Apr 22, 2026

0.3.2

Apr 21, 2026

0.3.1

Apr 18, 2026

0.3.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folitools-0.4.0.tar.gz (108.5 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

folitools-0.4.0-py3-none-any.whl (118.0 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file folitools-0.4.0.tar.gz.

File metadata

Download URL: folitools-0.4.0.tar.gz
Upload date: Apr 22, 2026
Size: 108.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for folitools-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`7b34c9de9cc96e5f342780155e1d75a1c00c30789270cd1ee86efbac459229f9`
MD5	`2c86da31efc888e5820507d655e73cad`
BLAKE2b-256	`bd2e5de3421fa2b36d052c82379ccc0bcdd87680b2c9bb456746fb95ce2f5280`

See more details on using hashes here.

Provenance

The following attestation bundles were made for folitools-0.4.0.tar.gz:

Publisher: publish.yml on whatever60/folitools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: folitools-0.4.0.tar.gz
- Subject digest: 7b34c9de9cc96e5f342780155e1d75a1c00c30789270cd1ee86efbac459229f9
- Sigstore transparency entry: 1356144209
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: whatever60/folitools@4635bb151921f49e4fca402750073e869a8fd50c
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/whatever60
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4635bb151921f49e4fca402750073e869a8fd50c
- Trigger Event: push

File details

Details for the file folitools-0.4.0-py3-none-any.whl.

File metadata

Download URL: folitools-0.4.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 118.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for folitools-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52550e7cbb8fc13419019e7dfff01f268fcb4d8b682d6558693a02c28555e7f6`
MD5	`b778c4c7145d1ec487558fa963539bb3`
BLAKE2b-256	`cd7c7f4656f272cbd4ac0c7fec618459f8ee6a70e51fd357885f0a1aea7b09c0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for folitools-0.4.0-py3-none-any.whl:

Publisher: publish.yml on whatever60/folitools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: folitools-0.4.0-py3-none-any.whl
- Subject digest: 52550e7cbb8fc13419019e7dfff01f268fcb4d8b682d6558693a02c28555e7f6
- Sigstore transparency entry: 1356144251
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: whatever60/folitools@4635bb151921f49e4fca402750073e869a8fd50c
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/whatever60
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4635bb151921f49e4fca402750073e869a8fd50c
- Trigger Event: push

folitools 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

folitools

Installation

Cutadapt fork

Usage

Expected Inputs

Step 1. Preprocessing

Step 2. Probe assignment

Step 3. Mapping and Feature Counting

Step 4. UMI-based Gene Counting

Primer Selection Functionality

Workflow

Recover

Dimer Evaluation (through Python)

Testing

In silico amplification for primer pool evaluation

Output Files

Primer Naming Convention

Gene Name List Consolidation

Stage 1: Pseudo-like Gene Filtering

Stage 2: Family Collapsing

Examples

Understanding Amplicon Counting Output

TODOs

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance