Foli-seq data processing and analysis tools
Project description
folitools
folitools is a modular CLI toolkit and python library for Foli-seq data processing.
Foli-seq is a amplicon-based high-throughput sequencing method for profiling host gene expression from feces. Foli-seq is developed to probe amplicons of 320-380bp and sequence on PE150 Illumina sequencers.
Installation
Install from PyPI:
pip install folitools
The dependencies from conda are also required (note: cutadapt is intentionally
not in this list — see below):
conda install -c bioconda -c conda-forge \
fastp samtools bwa-mem2 star seqkit fastqc subread sambamba pigz gcc
Cutadapt fork
Starting with folitools 0.4.0, cutadapt is installed from our fork,
published on PyPI as cutadapt-folitools (source:
whatever60/cutadapt@folitools-perf). The fork installs the same
cutadapt Python module and cutadapt console script as upstream, so
nothing in your own tooling needs to change. It adds performance
improvements that matter for the foli assign-probes workload (hundreds
of i5/i7 primers, paired-end gzipped FASTQ inputs):
SeedMultiAdapterFilter— a seed-and-extend pre-filter for large non-anchored multi-adapter groups. Match objects are bit-identical to upstream cutadapt's per-adapter iteration; only the speed changes. Auto-activates above 8 eligible adapters in a group.--input-compression-threads N— opt-in parallel decompression of gzipped inputs viapigz.NPrefixIndexedAdapters— a dormant class that offers a fast path for^N{k}<body>anchored adapters. Not wired into the auto path; reserved for future opt-in use.
End-to-end on a representative 3.1 M read-pair sample with 542×2 primers
(-j 16), the combined effect of the fork plus the xopen/pigz-backed
writer in add_umi cuts one sample from ~7:30 wall time to ~2:45
(~2.7× faster).
cutadapt-folitools is published as an sdist, so pip install folitools
will compile cutadapt from source. A C compiler is required — the
gcc entry in the conda command above covers it on Linux. On macOS the
Xcode command-line tools suffice.
If you already have the upstream cutadapt wheel installed from PyPI
it will be replaced by cutadapt-folitools when you install folitools,
because the two distributions install the same cutadapt module and
console script. Don't install both side-by-side.
If you would prefer the upstream cutadapt wheel on PyPI, you can skip the fork with:
pip install --no-deps folitools
pip install "cutadapt>=5.1" <other-deps>
but the foli assign-probes stage will be slower.
Usage
Each stage of the pipeline is exposed as a command via foli <subcommand>.
A note on input paths: All commands that accept input paths through the --input parameter allow files locations of absolute paths, relative paths, or glob patterns. Multiple paths/patterns are also allowed.
A note on paired-end files: Foli-seq uses paired-end sequencing. When inputing FASTQ files, only R1 path is given to the command and the corresponding R2 files are automatically derived by replacing the R1 pattern with R2 pattern (e.g., _R1_ → _R2_, _1 → _2). Both R1 and R2 files are required for the pipeline to work.
Expected Inputs
- Raw paired-end FASTQ files (following the naming pattern
*_R1_*.fastq.gzand*_R2_*.fastq.gz) - Adapter FASTA files
- STAR genome index directory
- GTF annotation file
Output directories are automatically created if they don't exist.
Step 1. Preprocessing
foli qc \
--input "/path/to/fastq/*_R1_*.fastq.gz" \
--output-dir "/path/to/trimmed_reads"
This command performs read trimming using fastp and runs quality check on output files with fastqc and seqkit.
What happens under the hood:
fastpperforms quality-based trimming with tail trimming and error correctionfastqcgenerates quality control reports for the trimmed readsseqkit statscompiles summary statistics for all output files
You can specify input FASTQ files from any location using absolute paths, relative paths, or glob patterns. The paired-end R2 files are automatically derived from R1 files by replacing _R1_ with _R2_ (or _1 with _2).
Output files (in the specified output directory):
{sample_name}_1.fq.gzand{sample_name}_2.fq.gz: Trimmed paired-end reads{sample_name}.fastp.htmland{sample_name}.fastp.json: fastp QC reports- FastQC reports in
{output_dir}_fastqc/ {output_dir}.stats: Summary statistics table generated by seqkit
Step 2. Probe assignment
foli assign-probes \
--input "/path/to/trimmed_reads/*_1.fq.gz" \
--output-dir "/path/to/umi_tagged" \
--i5 "/path/to/i5_short.fasta" \
--i7 "/path/to/i7_short.fasta"
This command performs gene probe primer assignment and trimming using cutadapt. This step extracts UMI sequences from adapter matches and embeds them in read names for downstream processing.
What happens under the hood:
cutadaptidentifies i5 and i7 adapters in the reads (minimum 20bp overlap, 10% error rate)- Adapter sequences and primer names are added to read headers
- A custom Python script (
folitools.add_umi) parses the adapter information and embeds UMI sequences into read names in a structured format seqkit statsgenerates summary statistics for the output files
After probe assignment, you can get read statistics:
foli get-read-stats \
--input "/path/to/umi_tagged/*_1.fq.gz" \
--output-dir "/path/to/read_stats" \
--overwrite
This analyzes the UMI-tagged reads to generate detailed statistics about primer assignment and read characteristics.
Output files:
{sample_name}_1.fq.gzand{sample_name}_2.fq.gz: UMI-tagged, adapter-trimmed reads with UMI information embedded in read names{output_dir}.stats: Summary statistics table generated by seqkit- Read statistics:
{sample_name}.parquetfiles in the stats output directory (when usingfoli get-read-stats)
Step 3. Mapping and Feature Counting
foli map \
--input "/path/to/umi_tagged/*_1.fq.gz" \
--output-bam "/path/to/mapped_reads" \
--output-star "/path/to/star" \
--star-index "/path/to/star_index" \
--gtf "/path/to/annotation.gtf"
This command performs streamlined alignment and feature counting using STAR and featureCounts.
What happens under the hood:
- STAR index preloading: The genome index is loaded into shared memory once for all samples to improve processing speed
- Read filtering: Short reads (< 60bp, considered primer dimers) are removed using
cutadapt - Alignment: Filtered reads are aligned to the genome using
STARwith:- Support for up to 1000 multi-mapping locations
- Unmapped reads included in output (
--outSAMunmapped Within) - Chimeric alignments detected and marked (
--chimOutType WithinBAM)
- Feature assignment:
featureCountsassigns aligned reads to genomic features from the GTF file - BAM tagging and sorting:
- UMI sequences are added as BAM tags (
USfor raw UMI,UCfor filtered UMI) - Cell barcodes added as
CBtag - Gene assignments added as
XFtag - BAM files are coordinate-sorted using
sambamba
- UMI sequences are added as BAM tags (
Core allocation: The --cores parameter is automatically split between STAR (70%) and featureCounts (30%) to optimize pipeline throughput.
For fractional counting of reads that overlap multiple features, you can use the --allow-overlap and --allow-multimapping flags:
foli map \
--input "/path/to/umi_tagged/*_1.fq.gz" \
--output-bam "/path/to/mapped_reads" \
--output-star "/path/to/star" \
--star-index "/path/to/star_index" \
--gtf "/path/to/annotation.gtf" \
--allow-overlap \
--allow-multimapping
These flags enable fractional counting where reads overlapping multiple features or mapping to multiple locations contribute fractionally to each feature.
Output files:
- STAR alignment files in
{output_star}/{sample_name}/:Aligned.out.bam: Unsorted alignment fileLog.final.out,Log.out,Log.progress.out: STAR log filesReadsPerGene.out.tab: Gene counts from STAR
- Sorted, tagged BAM files in
{output_bam}/:{sample_name}.sorted.bam: Final sorted BAM with UMI and gene tags{sample_name}.sorted.bam.bai: BAM index file
featureCountsresults in{output_bam}/:{sample_name}.featureCounts: Read-feature assignments{sample_name}.featureCounts.summary: Assignment statistics
Step 4. UMI-based Gene Counting
foli count \
--input "/path/to/mapped_reads/*.sorted.bam" \
--output-dir "/path/to/count_results"
This command processes BAM files using umi_tools group to generate UMI-deduplicated count data.
What happens under the hood:
- UMI grouping:
umi_tools groupidentifies and groups reads with:- The same UMI sequence (extracted from
UCtag) - The same cell barcode (
CBtag) - The same gene assignment (
XFtag)
- The same UMI sequence (extracted from
- Read handling:
- Paired-end reads are processed together
- Unmapped reads, chimeric pairs, and unpaired reads are all retained in the output
- Only primary alignments are used for grouping (secondary/supplementary alignments are ignored)
- Deduplication method: Uses the "unique" method where only reads with identical UMI sequences are grouped together
- Output generation: Produces detailed grouping information showing which reads were collapsed into each UMI group
Input BAM files should contain UMI tags (UC), cell tags (CB), and gene assignment tags (XF) from the previous mapping step.
After counting, generate the final count matrix:
foli get-count-mtx \
--input "/path/to/count_results/*.group.tsv.gz" \
--output "/path/to/final_counts.tsv" \
--gtf "/path/to/annotation.gtf"
What happens under the hood:
- Reads all UMI grouping tables from
umi_tools groupoutput - Counts unique UMI groups per gene per sample
- If GTF is provided, maps gene IDs to gene symbols
- Generates a gene × sample count matrix
You can also use the Python function directly:
from folitools.get_matrix import read_counts
df = read_counts("/path/to/input_files", "/path/to/annotation.gtf")
df.to_csv("/path/to/output.tsv", sep="\t", index_label="gene")
Output files:
{sample_name}.group.tsv.gz: UMI grouping tables with detailed information about which reads were grouped together{sample_name}.group.log: Processing logs showing statistics for each sample- Final count matrix: Gene × sample matrix in TSV format with gene symbols (if GTF provided) or gene IDs
Primer Selection Functionality
The primer selection module provides tools for designing and recovering PCR primer sets for targeted amplicon sequencing experiments.
Workflow
The primer selection workflow provides a complete pipeline for primer design:
# Run complete primer design workflow using built-in reference
foli-primer workflow \
--genes "genes.tsv" \
--species "mouse" \
--output-dir "primer_design/"
# Or use custom transcriptome FASTA for better coverage
foli-primer workflow \
--genes "genes.tsv" \
--txome-fasta "/path/to/transcriptome.fasta" \
--output-dir "primer_design/"
Input: Gene table TSV with columns gene and group
Output: Complete primer design including optimized primer sets, amplicon sequences, and IDT-compatible ordering files
Note: You can use either --species (mouse/human) for built-in references or --txome-fasta for custom transcriptome files. Custom FASTA files often provide better gene coverage than the built-in references.
Recover
The recover functionality helps validate and analyze primer sets from IDT order files:
# Recover primer information using built-in reference
foli-primer recover \
--order-excel "idt_order.xlsx" \
--output-dir "recovered_output/" \
--species "human"
# Or use custom transcriptome FASTA (recommended for better coverage)
foli-primer recover \
--order-excel "idt_order.xlsx" \
--output-dir "recovered_output/" \
--txome-fasta "/path/to/transcriptome.fasta"
This command analyzes primer sequences from an IDT order file, validates them against a reference transcriptome, and generates output files for downstream analysis. The output FASTA files (i5_short.fasta and i7_short.fasta) will contain the same number of unique primer sequences as found in the input Excel file, ensuring all original primers are represented.
Note: Using --txome-fasta with a custom transcriptome file is recommended over the built-in --species references as it typically provides better gene coverage than the packaged Gencode references.
You can also specify the amplicon length range, and whether the primers have linkers:
foli-primer recover \
--order-excel "idt_order.xlsx" \
--output-dir "recovered_output/" \
--txome-fasta "/path/to/transcriptome.fasta" \
--has-linker \
--amplicon-length-range 300 400
Input: IDT order Excel file with primer sequences
Output:
summary_primer_to_order.xlsx: Validated primer summary with amplicon informationprimer_diagnose.pdf: PDF report with analysis plots and statisticsi5_short.fastaandi7_short.fasta: FASTA files for sequencing adapters
Dimer Evaluation (through Python)
A primer set can be evaluated by calculating the thermodynamic properties of potential primer dimers for each pair of primers.
from folitools.primer_selection.eval_dimer import dimer_thermo_property
# Evaluate primer-dimer interactions
result_matrix = dimer_thermo_property(
primer_fwd_fasta="forward_primers.fasta",
primer_rev_fasta="reverse_primers.fasta",
output_dir="dimer_analysis/",
output_suffix="_analysis"
)
Input: FASTA files containing forward and reverse primer sequences
Output:
- Pairwise interaction matrix with thermodynamic properties
- Detailed analysis files in the output directory
Testing
Run all tests:
pytest
Run only shell script tests:
pytest tests/test_shell_scripts.py -v
Run tests with coverage:
pytest --cov=src/folitools --cov-report=html
In silico amplification for primer pool evaluation
Given a list of primer sequences used in foli-seq (with transcript-specific sequences, TruSeq R1/R2 overhang and with or without linker sequences), we validate the primer pool with de novo in silico PCR, without prior knowledge of the target genes, so that the upstream gene selection & panel design is decoupled from the evaluation here.
This functionality helps you understand which genes are actually targeted by your primer pool and identify potential issues like off-target binding or missing amplicons.
The workflow consists of the following steps:
-
Primer parsing and classification: The input IDT order Excel file (first two columns: pool name and primer sequence) is parsed to extract primer sequences. Each primer is classified as forward or reverse based on matching against known universal prefixes:
- Forward:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT(TruSeq R1 adapter) +NNNNNN(UMI) + optionalACATCAlinker - Reverse:
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT(TruSeq R2 adapter) +NNNNNN(UMI) + optionalATAGTTlinker
After classification, the universal prefix and UMI are stripped to extract the gene-specific targeting sequence for each primer.
- Forward:
-
Transcriptome mapping: Using
seqkit locate, all gene-specific primer sequences are mapped against the provided transcriptome FASTA to identify where each primer can bind. This uses exact matching (no mismatches) to find all potential binding sites across all transcripts. -
Amplicon enumeration: Forward and reverse primer hits are systematically paired on each transcript to enumerate all possible amplicons. For a valid amplicon:
- Forward primer must match the transcript sequence (sense strand)
- Reverse primer must match the reverse complement (antisense strand)
- Primers must be on opposite strands with proper orientation
- Forward primer position must be upstream of reverse primer position
-
Amplicon filtering: The enumerated amplicons are filtered based on:
- Length constraints (default 320-380 bp, configurable via
--amplicon-length-range) - Primer type validation (must be complementary R1 + R2 pair)
Amplicons outside the target length range or with incorrect primer type combinations are excluded.
- Length constraints (default 320-380 bp, configurable via
-
Grouping and summarization: Valid amplicons are grouped by unique primer pairs. For each pair, the function:
- Aggregates all transcripts and genes that can be amplified
- Detects multi-mapping primers (primers binding to multiple genes)
- Checks for cross-pool amplicons (primer pair from different pools)
- Creates a semicolon-separated list of all transcript mappings
-
Gene name simplification (optional, enabled by default): Gene symbols are processed to remove redundant information and improve readability. For example, gene families like "Gm12345" or immunoglobulin segments are consolidated to avoid cluttered names.
Output Files
The command generates several output files in the specified output directory, saved in the order they are produced during the workflow:
-
locate_df.csv: Raw output fromseqkit locate. Contains all matches of gene-specific primer sequences in the transcriptome. This is the unprocessed mapping result before any processing. Columns:seqID: Transcript identifier from FASTA header (pipe-separated format)pattern: The gene-specific primer sequence that was searchedstrand: Strand orientation (+or-)start,end: Starting and ending positions of match on transcript (1-based, inclusive)matched: The actual matched sequence from the transcript
-
locate_df_final.csv: Processed version oflocate_dfwith additional metadata joined from the primer info and parsed transcript/gene information. Columns are:primer_seq,primer_seq_full: Gene-specific targeting sequence and complete primer (with universal prefix and UMI)primer_type: Forward (fwd) or reverse (rev) classificationtranscript_id,gene_id,gene_symbol: Transcript/gene identifiers parsed from seqID (indices 0, 1, and 5)start,end: Starting and ending positions of match on transcript (1-based, inclusive)strand: Strand orientation (+or-)pool: Pool assignment from IDT order file
This file shows where each primer can bind across the transcriptome with full biological context.
-
amplicon_all.csv: Complete enumeration of all possible amplicons formed by pairing forward and reverse primers on the same transcript. Each row represents one potential amplicon. Columns are:primer_seq_fwd,primer_seq_rev: Forward and reverse primer gene-specific sequencesprimer_seq_fwd_type,primer_seq_rev_type: Primer type classifications (should befwdandrev)transcript_id,gene_id,gene_symbol: Transcript and gene identifiersstart_up,end_up: Upstream (forward) primer binding site positions on transcript (1-based, inclusive)start_down,end_down: Downstream (reverse) primer binding site positions on transcript (1-based, inclusive)pool_fwd,pool_rev: Pool assignments for forward and reverse primersflipped: Boolean indicating if primer pair orientation was flipped during enumerationamplicon_length: Length of amplicon in base pairs (calculated asend_down - start_up + 1)pass: Boolean indicating whether the amplicon passes filtering criteria (length range and complementary primer types)
This includes all enumerated amplicons. Filter by
pass == Trueto get only valid amplicons that would be expected in the actual experiment. -
grouped.csv: Summary table where amplicons are grouped by unique primer pairs. Each row represents one primer pair with aggregated information across all its amplicons. Columns are:primer_seq_fwd,primer_seq_rev: Forward and reverse primer gene-specific sequencestranscript_id,gene_id,gene_symbol: Representative transcript/gene identifiers (from first amplicon in group sorted by gene_id and transcript_id)start_up,end_up: Forward primer binding site positions on representative transcript (1-based, inclusive)start_down,end_down: Reverse primer binding site positions on representative transcript (1-based, inclusive)pool_fwd,pool_rev: Pool assignments for forward and reverse primersnum_transcripts,num_genes: Number of unique transcripts and genes amplified by this primer pairtranscript_id_all: Semicolon-separated list of all transcript-gene mappings for this pair (format:transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length)
This shows the specificity of each primer pair, reveals multi-gene amplification, and provides detailed binding site information for each transcript target.
-
primer_diagnose.pdf: A diagnostic PDF report containing a log-log histogram of amplicon lengths across all enumerated amplicons, with vertical dashed lines indicating the target length range. This visualization helps identify the distribution of amplicon sizes and assess whether the specified range captures the intended products. -
summary_primer_to_order.xlsx: Final Excel summary compatible with the primer design workflow format. Contains columns:Group,geneSymbol,geneID: Default group assignment, gene symbol(s), and gene ID (without version)Chosen Index,amplicon_index: Design index (default 0) and unique identifier (transcript ID without version + chosen index)L_seq,R_seq: Forward/reverse primer display sequences (with linker, excluding universal prefix and UMI)primer_sequence_to_order_forward,primer_sequence_to_order_reverse: Complete primer sequences as they appear in the order file (with universal prefix, UMI, and linker)L_pool,R_pool: Pool assignments for forward and reverse primersmulti_mapping: Semicolon-separated list of all transcript-gene mappings (format:transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length)
-
i5_short.fasta&i7_short.fasta: FASTA files containing gene-specific sequences for forward (i5) and reverse (i7) primers respectively. These sequences include any linker sequences but exclude the universal TruSeq adapters and UMIs. The number of records matches the number of unique primer sequences in the input order file. These files are suitable for use in downstream SADDLE analysis or dimer prediction. -
recover.log: Detailed log file capturing all processing steps, sanity checks, and warnings. The log includes both truncated output (for readability) and full details for comprehensive debugging.
Primer Naming Convention
In the output FASTA files (i5_short.fasta and i7_short.fasta), primer names (record IDs) are constructed based on the genes they amplify:
-
Single Gene Mapping: If a primer sequence maps to only one gene (e.g.,
GAPDH), the record ID is simply the gene symbol:GAPDH. -
Multi-Gene Mapping: If a primer sequence maps to multiple genes, the record ID is a pipe-separated list of all unique gene symbols in sorted order. For example, a primer targeting both
GENE_AandGENE_Bwould have ID:GENE_A|GENE_B. -
Gene Name Simplification: When
--simplify-gene-nameis enabled (default), redundant tokens in gene symbols are removed. For instance, if a primer maps to bothIghv1-1andIghv1-2, these might be collapsed to a family representative likeIghv1to avoid overly long names. -
ID Collision Resolution: If multiple unique primer sequences map to the exact same set of genes (resulting in identical record IDs), a numeric suffix is automatically appended to distinguish them. For example:
GAPDH,GAPDH.1,GAPDH.2. This ensures all primers are represented even when they target the same genes. -
Missing Mappings: Primers from the input excel file that don't appear in any valid amplicon after filtering (e.g., no transcriptome match, or all amplicons fail length/type filters) are still included in the output FASTA files with ID
_NAto ensure completeness. This guarantees that the output FASTA files contain exactly the same number of unique primer sequences as in the input order file.
Gene Name List Consolidation
When a primer maps to multiple genes, the gene name simplification feature (enabled by default with --simplify-gene-name) applies intelligent filtering rules to produce clean, informative primer names. The consolidation follows a two-stage process, controlled by the collapse_families argument (default: True). When collapse_families=False, only Stage 1 (pseudo-like filtering) is applied; otherwise, both stages are applied.
Stage 1: Pseudo-like Gene Filtering
The first stage identifies and removes "pseudo-like" gene symbols—names that are typically uninformative placeholders or annotation artifacts. A gene symbol is considered pseudo-like if it matches any of these patterns:
- Database identifiers: Ensembl gene IDs (e.g.,
ENSG00000173366,ENSMUSG00000026193) - Mouse gene models: Placeholder names like
Gm10358,Gm3839 - RIKEN cDNA clones: Mouse placeholders ending in
Rik(e.g.,1700003D09Rik) - Long non-coding RNA placeholders: Patterns like
MMP24OS,FOXP3-AS1,LINC01234,AC123456.1,AL123456.1,RP11-... - Explicit pseudogene markers: Suffixes like
-ps,-rs, orRNAxxSPx(e.g.,Clca4c-ps,Rn18s-rs5,RNA18SP5) - Pseudogene patterns: Genes ending in digits followed by
P(e.g.,DEFA10P,CLCA3P,KRT8P1)- Exception list protects real genes:
FOXP1-4,GBP1,ZBP1,GTPBP1,SPP1,DUSP1,AKAP1,TRAP1,RAMP1
- Exception list protects real genes:
- Paralog-like variants: Patterns like
Klk1b1where a base ending in a digit is followed by lowercase letters and more digits
Filtering rule: If at least one non-pseudo-like gene symbol exists in the list, all pseudo-like symbols are dropped. If all symbols are pseudo-like, they are all retained to avoid losing all information.
Special cases:
- Mitochondrial genes (prefixed with
MT-ormt-) are always considered informative and kept - Readthrough genes (format
GENE_A-GENE_B) are dropped only if both component genes (GENE_AandGENE_B) also appear separately in the list; otherwise the readthrough symbol is kept as informative
Stage 2: Family Collapsing
After pseudo-like filtering, the second stage collapses gene family members with trivial suffix variants—but only when a clear base gene exists. This prevents overly verbose names while avoiding arbitrary choices:
- Paralog families: Patterns like
Klk1b1,Klk1b3,Klk1b11,Klk1b14-psare recognized as family members of baseKlk1- Trivial tails must have form:
<base><lowercase_letters><digits>or<base><digits><lowercase_letters>, optionally with-pssuffix - Base must be at least 4 characters long
- Trivial tails must have form:
- Pseudogene families: Patterns like
KRT8P1,KRT8P2are recognized as pseudogene variants of baseKRT8- Pattern:
<base_ending_in_digit>P<digits>
- Pattern:
Collapsing rule: Family members are collapsed only if the base gene is present in the list. For example:
["Klk1", "Klk1b1", "Klk1b3"]→["Klk1"](base present, collapse all variants)["Klk1b1", "Klk1b3"]→["Klk1b1", "Klk1b3"](base absent, keep both variants separately)["Selenbp1", "Selenbp2"]→["Selenbp1", "Selenbp2"](these are distinct paralogs, not trivial variants)
This conservative approach ensures that family members are only merged when there's clear evidence (presence of the base gene) that they represent variants of the same functional unit, avoiding silent loss of information.
Examples
["GAPDH", "ENSG00000111640"]→["GAPDH"](Ensembl ID dropped)["Ppp1r1b", "1700003D09Rik"]→["Ppp1r1b"](RIKEN clone dropped)["DEFA1", "DEFA10P"]→["DEFA1"](pseudogene dropped)["Klk1", "Klk1b1", "Klk1b11", "Klk1b3"]→["Klk1"](family collapsed to base)["KRT8P1", "KRT8P2"]→["KRT8P1", "KRT8P2"](no baseKRT8, so keep both)["IFNAR2-IL10RB"]→["IFNAR2-IL10RB"](readthrough kept when components absent)["KLRC4-KLRK1", "KLRK1"]→["KLRK1"](readthrough dropped when component present)
This simplification is applied when constructing FASTA record IDs from multi-gene mappings, making primer names more concise and biologically meaningful while preserving essential information.
Understanding Amplicon Counting Output
Foli-seq's amplicon-based design produces count matrices with feature names that differ from traditional bulk RNA-seq or scRNA-seq outputs. Because the protocol uses targeted primers that can amplify multiple amplicons (due to gene family similarity) and multiple primers can target the same transcript (or transcripts of the same gene), the feature naming reflects both gene identity and multi-mapping relationships.
During the pipeline, feature names evolve through several stages. When cutadapt identifies primer adapters in step 2, primer names are embedded into read names (e.g., @READ_ID_ACGTAC_TGCATG_FGR+FGR), creating amplicon-specific identifiers. In step 3, featureCounts assigns reads to gene features and stores results in the XT tag—either single genes (XT:Z:ENSG00000010030) or comma-separated lists for multi-mapping reads (XT:Z:ENSG00000010030,ENSG00000285441). The pipeline then combines these gene assignments with primer information to create composite features in the XF tag, such as ENSG00000010030,FGR+FGR (single gene amplified by FGR primers) or ENSG00000010030,ENSG00000285441,FGR+FGR (multi-mapping).
umi_tools group deduplicates reads by grouping on UMI sequence, cell barcode, and the gene-with-primer assignment from the XF tag. Finally, the read_counts function aggregates these amplicon-level counts to gene-level by stripping primer suffixes from feature names. When multiple primer pairs target the same transcript or transcripts of the same gene, their counts are averaged. Multi-mapping reads remain as pipe-separated features (e.g., GENE_A|GENE_B). If a GTF file is provided, gene IDs are converted to symbols while preserving multi-mapping relationships (e.g., ENSG00000010030 → FGR, or ENSG00000111|ENSG00000222 → Klk1b1|Klk1b3). For multi-gene features, the same gene name simplification logic described in the "Gene Name List Consolidation" section above is applied to remove pseudo-like genes and collapse family variants, ensuring consistent naming between primer pool validation and count matrix generation.
In summary, the resulting count matrix has samples as rows and "enhanced" gene features as columns. Single-mapping genes appear as simple gene symbols (FGR, GAPDH), while multi-mapping gene families are indicated with pipe-separated names (BCL9L|CXCR5). Thus, the values represent UMI-deduplicated read counts where each UMI corresponds to one original mRNA molecule, and multiple amplicons targeting the same gene are averaged. This design handles amplicon multiplicity through count aggregation and preserves biological ambiguity due to sequence similarity, providing gene-level quantification suitable for differential expression analysis.
TODOs
- Add QC report code that summarizes all metrics into a table (need refactoring)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file folitools-0.4.0.tar.gz.
File metadata
- Download URL: folitools-0.4.0.tar.gz
- Upload date:
- Size: 108.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b34c9de9cc96e5f342780155e1d75a1c00c30789270cd1ee86efbac459229f9
|
|
| MD5 |
2c86da31efc888e5820507d655e73cad
|
|
| BLAKE2b-256 |
bd2e5de3421fa2b36d052c82379ccc0bcdd87680b2c9bb456746fb95ce2f5280
|
Provenance
The following attestation bundles were made for folitools-0.4.0.tar.gz:
Publisher:
publish.yml on whatever60/folitools
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
folitools-0.4.0.tar.gz -
Subject digest:
7b34c9de9cc96e5f342780155e1d75a1c00c30789270cd1ee86efbac459229f9 - Sigstore transparency entry: 1356144209
- Sigstore integration time:
-
Permalink:
whatever60/folitools@4635bb151921f49e4fca402750073e869a8fd50c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/whatever60
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4635bb151921f49e4fca402750073e869a8fd50c -
Trigger Event:
push
-
Statement type:
File details
Details for the file folitools-0.4.0-py3-none-any.whl.
File metadata
- Download URL: folitools-0.4.0-py3-none-any.whl
- Upload date:
- Size: 118.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52550e7cbb8fc13419019e7dfff01f268fcb4d8b682d6558693a02c28555e7f6
|
|
| MD5 |
b778c4c7145d1ec487558fa963539bb3
|
|
| BLAKE2b-256 |
cd7c7f4656f272cbd4ac0c7fec618459f8ee6a70e51fd357885f0a1aea7b09c0
|
Provenance
The following attestation bundles were made for folitools-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on whatever60/folitools
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
folitools-0.4.0-py3-none-any.whl -
Subject digest:
52550e7cbb8fc13419019e7dfff01f268fcb4d8b682d6558693a02c28555e7f6 - Sigstore transparency entry: 1356144251
- Sigstore integration time:
-
Permalink:
whatever60/folitools@4635bb151921f49e4fca402750073e869a8fd50c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/whatever60
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4635bb151921f49e4fca402750073e869a8fd50c -
Trigger Event:
push
-
Statement type: