A utility for fitting lineage fitnesses within a pooled competition experiment, achieved by iteratively optimizing models of individual and population-average lineage fitness.
Project description
fitseq
Accurate pooled competition assays requires accounting for the changing population-average fitness in order toe best estimate the fitness of lineages within the pool. This utility does both by iterating between optimizations of per-lineage fitness given the average, and calculating the new average fitness.
The core concept and architecture is written by FangFei Li (@FangFei05). This fork is just tweaking the interface and metrics for my personal use, but if anyone else wants to use it I can offer limited help.
On a recent datasets of five timepoints for ~3.5 million lineages,
fitseq
was finished within 4.5 hours (wall), using 20 cores and 4GB of RAM.
If you use this software, please reference: F. Li, et al. Unbiased Fitness Estimation of Pooled Barcode or Amplicon Sequencing Studies. Cell Systems, 7: 521-525 (2018)
Installation
With pip
You can install from this git repo directly as:
python3 -m pip install git+https://github.com/darachm/fitseq.git
Install the latest development branch with something like:
python3 -m pip install git+https://github.com/darachm/fitseq.git@dev
Test installation with:
fitseq.py -h
Or don't install, use a container
Docker
This repository has a Dockerfile
to build off the main branch of this repo,
like so
docker build -t darachm/fitseq ./
Or you should be able to pull it off of Dockerhub like:
docker run darachm/fitseq:latest fitseq.py -h
You can then use that, with a Docker installation like so:
docker run \
--mount type=bind,source=$(pwd)/testing,target=/testing \
darachm/fitseq \
fitseq.py \
-i testing/data/ppiseq_test_counts_1000.csv \
-p 8 -t 0 1 2 3 4 \
-m 20 --min-step 0.001 \
--output-mean-fitness testing/output/test_means.csv \
-o testing/output/test_out.csv
Note that you need to --mount
the directory with the inputs in a folder,
and use that in the command. Directories on containers... yeah.
I think Singularity is more intuitive/accessible for most folks...
Singularity
On a multi-user HPC? Want to get an achive-ready monolithic stable container? Singularity is a container system for scientific multi-user HPC computing and archiving. You can build your own container from the Singularity file in this repo using a command like:
sudo singularity build fitseq.sif Singularity.fitseq
( This is just turning the docker image into a Singularity image. Just so you know. )
Then put it where you need it, and run with something like:
singularity exec fitseq.sif \
fitseq.py \
-i testing/data/ppiseq_test_counts_1000.csv \
-p 8 -t 0 1 2 3 4 \
-m 20 --min-step 0.001 \
--output-mean-fitness testing/output/test_means.csv \
-o testing/output/test_out.csv
Usage
The fitseq.py
script functions to estimate fitnesses of lineages in a pool.
There is also a script evo_simulator.py
that can perform simulations of
competitive pooled growth of lineages, in order to generate synthetic data for
benchmarking.
fitseq.py
- estimate fitnesses from counts data
This tool expects a comma-separated table (CSV) of your best estimate of lineage counts of the lineage, with one column per timepoint. Each lineage is a row, and outputs are in the same order as the input.
For an example using data distributed in this repo, try:
python3 fitseq.py \
--input testing/data/ppiseq_test_counts_1000.csv \
--processes 8 \
--t-seq 0 1 2 3 4 \
--min-iter 10 \
--max-iter-num 100 \
--min-step 0.001 \
--output-mean-fitness test_output_means.csv \
-o test_output.csv
This reads an input file at
testing/data/ppiseq_test_counts_1000.csv
, and uses 8 processes.
It assumes each sample is 1 "generation" of growth.
It does a mandatory 10 iterations of burn-in to stabilize the estimates, then
proceeds until the sum of negative log likelihood doesn't improve by at least
0.1% over the previous step, at 100 iterations max.
Then it writes the mean fitness values to that CSV,
and the rest to test_output.csv
.
File formats
Input file format
This tool expects a comma-separated table (CSV) of your best estimate of lineage counts of the lineage, with one column per timepoint. Each lineage is a row, and outputs are in the same order as the input.
Something like:
21,7,2,0,0
35,71,34,38,12
5,9,1,0,0
3,8,4,3,1
12,10,11,1,0
Output file format
There are two outputs generated, the first is the per-lineage (per input row) fit parameters, an estimate of error of the optimization process[^eerror], and the model-projected psuedo-count expectations for each timepoint. For example, but rounded to 3 decimal places for tidy-ness:
Estimated_Fitness,Estimation_Error,Likelihood_Log,Estimated_Read_Number_t0,Estimated_Read_Number_t1,Estimated_Read_Number_t2,Estimated_Read_Number_t3,Estimated_Read_Number_t4
-1.529,0.517,4.998,21.0,6.608,2.148,0.298,0.004
-0.274,0.189,21.556,35.0,38.661,76.501,17.801,5.471
-0.728,0.316,10.277,5.0,3.504,6.152,0.332,0.009
-0.194,0.252,10.379,3.0,3.588,9.333,2.267,0.467
-0.596,0.214,7.849,12.0,9.602,7.805,4.172,0.104
-2.942,1.591,2.233,5.0,0.383,0.007,0.003,0.000
The headers are a bit long, I suppose. But they're informative...?
[^eerror]: The "Estimation_Error" is probably not what you're expecting. It's the second derivative of the change in sum-negative-log-likelihood of this lineage's optimization. So that might not be what you're wanting to use to filter lineages. You can however use the "Estimated" counts to calculate an R^2, have fun.
There is also the optional (and strongly suggested!) output file of the mean fitness per timepoint, given as a CSV format with headers, such as:
Samples,Estimate_Mean_Fitness
0,0.0
1,0.3833658804427269
2,0.9907541206263276
3,1.0394387021430962
4,1.0542176653660411
Options, from fitseq.py -h
input
-i
INPUT,--input
INPUT The path to a header-less
CSV file, where each column contains the count of each lineage (each row is a lineage) at that sample/timepoint. REQUIRED--t-seq
[T_SEQ [T_SEQ ...]],-t
[T_SEQ [T_SEQ ...]] The estimated "generations" of growth elapse at each sampled timepoint. This is useful for scaling the fitness or using unevenly spaced timepoints. REQUIRED
output
-o
OUTPUT,--output
OUTPUT The path (default STDOUT) from which to output the fitnesses and errors and likelihoods and estimated reads. CSV format. (default: STDOUT, so that you can pipe it into other tools)--output-mean-fitness
OUTPUT_MEAN_FITNESS,-om
OUTPUT_MEAN_FITNESS The path (default None) to which to write the mean fitnessescalculated per sample.
parallelism
-p
PROCESSES,--processes
PROCESSES Number of processes to launch with multiprocessing--max-chunk-size
MAX_CHUNK_SIZE The max chunksize for parallelism, automatically set to a roughly even split of lineages per chunk. Tune if you want to.
optimization stopping control
--min-iter
MIN_ITER Force FitSeq to run at least this many iterations in the optimization (default: 10)--max-iter-num
MAX_ITER_NUM,-m
MAX_ITER_NUM Maximum number of iterations in the optimization (of optimizing population average fitness) (default: 100)--minimum-step-size
MINIMUM_STEP_SIZE,--min-step
MINIMUM_STEP_SIZE Set a minimum fracitonal step size for improvement, if below this then the optimization iterations terminate. (default: 0.0001)
tuning details
--fitness-type
{m,w},-f
{m,w} SORRY but Wrightian fitness does not yet work in this version, so just don't set the--fitness_type
, or set tom
. Sorry. Maybe we'll re-implement Wrightian fitness (w). Maybe.-k
KAPPA,--kappa
KAPPA a noise parameter that characterizes the total noise introduced. For estimation, see doi:10.1038/nature14279 (default: 2.5)--gtol
GTOL The gradient tolerance parameter for the BFGS opitmization, default (from SciPy) is 1e-5-g
REGRESSION_NUM,--regression-num
REGRESSION_NUM number of points used in the initial linear-regression-based
fitness estimate (default: 2)
Evolution Simulator
evo_simulator.py
simulates competitive pooled growth of lineages.
This simulation includes sampling noise from growth,
cell transfers (bottlenecks), DNA extraction, PCR, and sequencing.
For example:
python evo_simulator.py -i input_EvoSimulation.csv \
-t 0 3 6 9 12 -r 50 50 50 50 50 \
-o output
python evo_simulator.py -i input_EvoSimulation.csv \
-t 0 2 4 6 8 -r 75 75 75 75 50 \
-n DNA_extraction PCR sequencing -d 300 -p 27 -f w \
-o output
Options
--input
or-i
: a .csv file, with- 1st column of .csv: fitness of each genotype, [x1, x2, ...]
- 2nd column .csv: initial cell number of each genotype at generation 0, [n1, n2, ...]
--t_seq
or-t
: time-points evaluated in number of generations (format: 0 t1 t2 ...)--read_num_average_seq
or-r
: average number of reads per genotype for each time-point (format: 0 r1 r2 ...)--noise_option
or-n
: which types of noise to include in the simulation, default is all sources of noise (default: growth bottleneck_transfer DNA_extraction PCR sequencing)--dna_copies
or-d
: average genome copy number per genotype used as template in PCR (default: 500)--pcr_cycles
or-p
: number of cycles of PCR (default: 25)--fitness_type
or-f
: type of fitness: Wrightian fitness (w), or Malthusian fitness (m)' (default: m)--output_filename
or-o
: prefix of output .csv files (default: output), this tool AUTOMATICALLY generates files named, for a-o
option ofoutput
:output_filename_EvoSimulation_Read_Number.csv
: read number per genotype for each time-pointoutput_filename_EvoSimulation_Mean_Fitness.csv
: mean fitness for each time-pointoutput_filename_EvoSimulation_Input_Log.csv
: a record of all inputs
See python evo_simulator.py --help
for a reminder...
History of version of this software.
- PyFitSeq is a Python-based fitness estimation tool for pooled amplicon sequencing studies. The conceptual/math work and first implementation is described in the paper Unbiased Fitness Estimation of Pooled Barcode or Amplicon Sequencing Studies, and this code is available here.
- This was rewritten in python, available here and is a python-translated version of the MATLAB tool FitSeq above.
- This repo is a fork of that python version to fix some bugs and tweak the
speed, flexibility, and interface.
Also, changed the name to
fitseq
to reflect that it's the current development version. Wrightian fitness does not yet work in this version. Sorry.
Run from the base repo directory, so bash testing/test_singularity.sh
After building the image of course
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fitseq-1.5.0.tar.gz
.
File metadata
- Download URL: fitseq-1.5.0.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df7ed12c12af13a2d6ffc32d7f5dda45b5514e25a3df98fc8f1fab0db242e386 |
|
MD5 | a272df5eed39e7d92bdd3642bcc4fd40 |
|
BLAKE2b-256 | 85395a857cfd952ad013cf351a50d94482814b2cfe533ad074ac4a22dc7f4c4f |
File details
Details for the file fitseq-1.5.0-py3-none-any.whl
.
File metadata
- Download URL: fitseq-1.5.0-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aa25a6c8c71ca56e6a3698bdee1e0ae410adfe8feb6135936cf544aa7e77008 |
|
MD5 | 49a35a85c33655b39d281a3cbd7af744 |
|
BLAKE2b-256 | 333ea9e29e570dbb052112124f9d00a5ea05e5dcc9117f927bee1104cac2999a |