Skip to main content

A utility for fitting lineage fitnesses within a pooled competition experiment, achieved by iteratively optimizing models of individual and population-average lineage fitness.

Project description

Python 3.9 License: MIT

fitseq

Accurate pooled competition assays requires accounting for the changing population-average fitness in order toe best estimate the fitness of lineages within the pool. This utility does both by iterating between optimizations of per-lineage fitness given the average, and calculating the new average fitness.

The core concept and architecture is written by FangFei Li (@FangFei05). This fork is just tweaking the interface and metrics for my personal use, but if anyone else wants to use it I can offer limited help.

On a recent datasets of five timepoints for ~3.5 million lineages, fitseq was finished within 4.5 hours (wall), using 20 cores and 4GB of RAM.

If you use this software, please reference: F. Li, et al. Unbiased Fitness Estimation of Pooled Barcode or Amplicon Sequencing Studies. Cell Systems, 7: 521-525 (2018)

Installation

With pip

You can install from this git repo directly as:

python3 -m pip install git+https://github.com/darachm/fitseq.git

Install the latest development branch with something like:

python3 -m pip install git+https://github.com/darachm/fitseq.git@dev

Test installation with:

fitseq.py -h

Or don't install, use a container

Docker

This repository has a Dockerfile to build off the main branch of this repo, like so

docker build -t darachm/fitseq ./ 

Or you should be able to pull it off of Dockerhub like:

docker run darachm/fitseq:latest fitseq.py -h

You can then use that, with a Docker installation like so:

docker run \
    --mount type=bind,source=$(pwd)/testing,target=/testing \
    darachm/fitseq \
    fitseq.py \
        -i testing/data/ppiseq_test_counts_1000.csv \
        -p 8 -t 0 1 2 3 4 \
        -m 20 --min-step 0.001 \
        --output-mean-fitness testing/output/test_means.csv \
        -o testing/output/test_out.csv

Note that you need to --mount the directory with the inputs in a folder, and use that in the command. Directories on containers... yeah.

I think Singularity is more intuitive/accessible for most folks...

Singularity

On a multi-user HPC? Want to get an achive-ready monolithic stable container? Singularity is a container system for scientific multi-user HPC computing and archiving. You can build your own container from the Singularity file in this repo using a command like:

sudo singularity build fitseq.sif Singularity.fitseq

( This is just turning the docker image into a Singularity image. Just so you know. )

Then put it where you need it, and run with something like:

singularity exec fitseq.sif \
    fitseq.py \
        -i testing/data/ppiseq_test_counts_1000.csv \
        -p 8 -t 0 1 2 3 4 \
        -m 20 --min-step 0.001 \
        --output-mean-fitness testing/output/test_means.csv \
        -o testing/output/test_out.csv

Usage

The fitseq.py script functions to estimate fitnesses of lineages in a pool. There is also a script evo_simulator.py that can perform simulations of competitive pooled growth of lineages, in order to generate synthetic data for benchmarking.

fitseq.py - estimate fitnesses from counts data

This tool expects a comma-separated table (CSV) of your best estimate of lineage counts of the lineage, with one column per timepoint. Each lineage is a row, and outputs are in the same order as the input.

For an example using data distributed in this repo, try:

python3 fitseq.py \
    --input testing/data/ppiseq_test_counts_1000.csv \
    --processes 8 \
    --t-seq 0 1 2 3 4 \
    --min-iter 10 \
    --max-iter-num 100 \
    --min-step 0.001 \
    --output-mean-fitness test_output_means.csv \
    -o test_output.csv

This reads an input file at testing/data/ppiseq_test_counts_1000.csv, and uses 8 processes. It assumes each sample is 1 "generation" of growth. It does a mandatory 10 iterations of burn-in to stabilize the estimates, then proceeds until the sum of negative log likelihood doesn't improve by at least 0.1% over the previous step, at 100 iterations max. Then it writes the mean fitness values to that CSV, and the rest to test_output.csv.

File formats

Input file format

This tool expects a comma-separated table (CSV) of your best estimate of lineage counts of the lineage, with one column per timepoint. Each lineage is a row, and outputs are in the same order as the input.

Something like:

21,7,2,0,0
35,71,34,38,12
5,9,1,0,0
3,8,4,3,1
12,10,11,1,0

Output file format

There are two outputs generated, the first is the per-lineage (per input row) fit parameters, an estimate of error of the optimization process[^eerror], and the model-projected psuedo-count expectations for each timepoint. For example, but rounded to 3 decimal places for tidy-ness:

Estimated_Fitness,Estimation_Error,Likelihood_Log,Estimated_Read_Number_t0,Estimated_Read_Number_t1,Estimated_Read_Number_t2,Estimated_Read_Number_t3,Estimated_Read_Number_t4
-1.529,0.517,4.998,21.0,6.608,2.148,0.298,0.004
-0.274,0.189,21.556,35.0,38.661,76.501,17.801,5.471
-0.728,0.316,10.277,5.0,3.504,6.152,0.332,0.009
-0.194,0.252,10.379,3.0,3.588,9.333,2.267,0.467
-0.596,0.214,7.849,12.0,9.602,7.805,4.172,0.104
-2.942,1.591,2.233,5.0,0.383,0.007,0.003,0.000

The headers are a bit long, I suppose. But they're informative...?

[^eerror]: The "Estimation_Error" is probably not what you're expecting. It's the second derivative of the change in sum-negative-log-likelihood of this lineage's optimization. So that might not be what you're wanting to use to filter lineages. You can however use the "Estimated" counts to calculate an R^2, have fun.

There is also the optional (and strongly suggested!) output file of the mean fitness per timepoint, given as a CSV format with headers, such as:

Samples,Estimate_Mean_Fitness
0,0.0
1,0.3833658804427269
2,0.9907541206263276
3,1.0394387021430962
4,1.0542176653660411

Options, from fitseq.py -h

input

  • -i INPUT, --input INPUT The path to a header-less CSV file, where each column contains the count of each lineage (each row is a lineage) at that sample/timepoint. REQUIRED
  • --t-seq [T_SEQ [T_SEQ ...]], -t [T_SEQ [T_SEQ ...]] The estimated "generations" of growth elapse at each sampled timepoint. This is useful for scaling the fitness or using unevenly spaced timepoints. REQUIRED

output

  • -o OUTPUT, --output OUTPUT The path (default STDOUT) from which to output the fitnesses and errors and likelihoods and estimated reads. CSV format. (default: STDOUT, so that you can pipe it into other tools)
  • --output-mean-fitness OUTPUT_MEAN_FITNESS, -om OUTPUT_MEAN_FITNESS The path (default None) to which to write the mean fitnessescalculated per sample.

parallelism

  • -p PROCESSES, --processes PROCESSES Number of processes to launch with multiprocessing
  • --max-chunk-size MAX_CHUNK_SIZE The max chunksize for parallelism, automatically set to a roughly even split of lineages per chunk. Tune if you want to.

optimization stopping control

  • --min-iter MIN_ITER Force FitSeq to run at least this many iterations in the optimization (default: 10)
  • --max-iter-num MAX_ITER_NUM, -m MAX_ITER_NUM Maximum number of iterations in the optimization (of optimizing population average fitness) (default: 100)
  • --minimum-step-size MINIMUM_STEP_SIZE, --min-step MINIMUM_STEP_SIZE Set a minimum fracitonal step size for improvement, if below this then the optimization iterations terminate. (default: 0.0001)

tuning details

  • --fitness-type {m,w}, -f {m,w} SORRY but Wrightian fitness does not yet work in this version, so just don't set the --fitness_type, or set to m. Sorry. Maybe we'll re-implement Wrightian fitness (w). Maybe.
  • -k KAPPA, --kappa KAPPA a noise parameter that characterizes the total noise introduced. For estimation, see doi:10.1038/nature14279 (default: 2.5)
  • --gtol GTOL The gradient tolerance parameter for the BFGS opitmization, default (from SciPy) is 1e-5
  • -g REGRESSION_NUM, --regression-num REGRESSION_NUM number of points used in the initial linear-regression-based fitness estimate (default: 2)

Evolution Simulator

evo_simulator.py simulates competitive pooled growth of lineages. This simulation includes sampling noise from growth, cell transfers (bottlenecks), DNA extraction, PCR, and sequencing. For example:

python evo_simulator.py -i input_EvoSimulation.csv \
    -t 0 3 6 9 12 -r 50 50 50 50 50 \
    -o output

python evo_simulator.py -i input_EvoSimulation.csv \
    -t 0 2 4 6 8 -r 75 75 75 75 50 \
    -n DNA_extraction PCR sequencing -d 300 -p 27 -f w \
    -o output

Options

  • --input or -i: a .csv file, with
    • 1st column of .csv: fitness of each genotype, [x1, x2, ...]
    • 2nd column .csv: initial cell number of each genotype at generation 0, [n1, n2, ...]
  • --t_seq or -t: time-points evaluated in number of generations (format: 0 t1 t2 ...)
  • --read_num_average_seq or -r: average number of reads per genotype for each time-point (format: 0 r1 r2 ...)
  • --noise_option or -n: which types of noise to include in the simulation, default is all sources of noise (default: growth bottleneck_transfer DNA_extraction PCR sequencing)
  • --dna_copies or -d: average genome copy number per genotype used as template in PCR (default: 500)
  • --pcr_cycles or -p: number of cycles of PCR (default: 25)
  • --fitness_type or -f: type of fitness: Wrightian fitness (w), or Malthusian fitness (m)' (default: m)
  • --output_filename or -o: prefix of output .csv files (default: output), this tool AUTOMATICALLY generates files named, for a -o option of output:
    • output_filename_EvoSimulation_Read_Number.csv: read number per genotype for each time-point
    • output_filename_EvoSimulation_Mean_Fitness.csv: mean fitness for each time-point
    • output_filename_EvoSimulation_Input_Log.csv: a record of all inputs

See python evo_simulator.py --help for a reminder...

History of version of this software.

  1. PyFitSeq is a Python-based fitness estimation tool for pooled amplicon sequencing studies. The conceptual/math work and first implementation is described in the paper Unbiased Fitness Estimation of Pooled Barcode or Amplicon Sequencing Studies, and this code is available here.
  2. This was rewritten in python, available here and is a python-translated version of the MATLAB tool FitSeq above.
  3. This repo is a fork of that python version to fix some bugs and tweak the speed, flexibility, and interface. Also, changed the name to fitseq to reflect that it's the current development version. Wrightian fitness does not yet work in this version. Sorry.

Run from the base repo directory, so bash testing/test_singularity.sh

After building the image of course

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitseq-1.5.0.tar.gz (17.9 kB view hashes)

Uploaded Source

Built Distribution

fitseq-1.5.0-py3-none-any.whl (22.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page