Pasio is a tool for segmentation and denosing DNA coverage profiles coming from highthroughput sequencing data.
Project description
PASIO
Pasio is a tool for denosing DNA coverage profiles coming from highthroughput sequencing data. Example of experiments pasio works well on is ChIPseq, DNAseseq, ATACseq.
It takes a .bed file of counts (integer values, normalization is not supported). And produces tsv file with genome splited into segments which coverage can be treated as equal.
Pasio runs on both Python 2 and 3 (Python 2 interpreter runs a bit faster). The only dependencies are numpy and scipy.
Recommended command line for most practical cases is:
python m pasio
bedgraph <PATH TO INPUT bedGraph FILE> o <PATH TO OUTPUT bedGraph FILE>
alpha 5 beta 1
no_split_constant
algorithm rounds
window_shift 1250 window_size 2500
Underlying math
PASIO is a program to segment chromatin accessibility profile. It accepts a bedgraph file with coverage of position by DNase cuts (e.g. by 5'ends of DNaseseq) and splits each contig/chromosome into segments with different accessibilites in an optimal way.
Method is based on two assumptions:
 cuts are introduced by Poisson process
P(λ)
withλ
depending on segment λ
are distributed asλ ~ Г(α, β) = β^α * λ^{α  1} * exp(βλ) / Г(α)
α
and β
are the only parameters of segmentation.
Then we can derive (logarithmic) marginal likelyhood logML
to be optimized.
logML
for a single segment S
of length L
with coverages (S_1, S_2, ...S_L)
and total coverage C = \sum_i(S_i)
will be:
logML(S,α,β) = α*log(β) − log(Γ(α)) + log(Γ(C + α)) − (C + α) * log(L + β) − \sum_i log (S_i!)
Here α*log(β) − log(Γ(α))
can be treated as a penalty for segment creation (and is approximately proportional to α*log(β/α))
.
Total logML(α,β)
is a sum of logarithmic marginal likelihoods for all segments: logML(α,β) = \sum_S logML(S,α,β)
.
Given a chromosome coverage profile, term \sum_S {\sum_i log (S_i!)}
doesn't depend on segmentation.
Value \sum_S {log(Γ(C + α)) − (C + α) * log(L + β)}
is refered as a self score
of a segment.
We optimize only segmentationdependent part of logML
which is termed just score
.
This score is a sum of self score of a segment and a penalty for segment creation.
Program design
split_bedgraph
loads bedgraph file chromosome by chromosome, splits them into segments and writes into output tsv file.
Coverage counts are stored internally with 1nt resolution.
Splitting is proceeded in two steps: (a) reduce a list of candidate split points (sometimes this step is omitted), (b) choose splits from a list of candidates and calculate score of segmentation. The first step is performed with one of so called reducers. The second step is performed with one of splitters (each splitter also implements a reducer interface but not vice versa).
Splitters and reducers:
 The most basic splitter is
SquareSplitter
which implements dynamic programming algorithm withO(N^2)
complexity whereN
is a number of split candidates. Other splitters/reducers perform some heuristical optimisations on top ofSquareSplitter
SlidingWindowReducer
tries to segment not an entire contig (chromosome) but shorter parts of contig. So they scan a sequence with a sliding window and remove split candidates which are unlikely. Each window is processed using some base splitter (typicallySquareSplitter
). Candidates from different windows are then aggregated.RoundReducer
perform the same procedure and repeat it for several rounds or until list of split candidates converges.NotZeroReducer
discards (all) splits if all points of an interval under consideration are zeros.NotConstantReducer
discards splits between samevalued points.ReducerCombiner
accept a list of reducers to be sequentially applied. The last reducer can also be a splitter. In that case combiner allows for splitting and scoring a segmentation. To transform any reducer into splitter one can combine that reducer withNopSplitter
 so that split candidates obtained by reducer will be treated as final splitting and NopSplitter make it possible to calculate its score.
Splits denote segment boundaries to the left of position. Adjacent splits a
and b
form semiclosed interval [a, b)
E.g. for coverage counts [99,99,99, 1,1,1]
splits should be [0, 3, 6]
.
So that we have two segments: [0, 3)
and [3, 6)
.
Splits and split candidates are stored as numpy arrays and always include both inner split points and segment boundaries, i.e. point just before config start and right after the contig end.
One can also treat splits as positions betweenelements (like in python slices)
counts:  99 99 99  1 1 1 
splits candidates: 0 1 2 3 4 5 6
splits: 0 3 6
Splitters invoke LogMarginalLikelyhoodComputer
which can compute logML
for a splitting (and for each segment).
LogMarginalLikelyhoodComputer
store cumulative sums of coverage counts at split candidates
and also distances between candidates. It allows one to efficiently compute logML
and doesn't need
to recalculate total segment coverages each time.
In order to efficiently compute log(x)
and log(Г(x))
we precompute values for some first million of integer numbers x
.
Computation efficiency restricts us to integer values of α
and β
. Segment lengths are naturally integer,
coverage counts (and total segment counts) are also integer because they represent numbers of cuts.
LogComputer
and LogGammaComputer
store precomputed values and know how to calculate these values efficiently.
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size pasio1.0.0py2.py3noneany.whl (20.0 kB)  File type Wheel  Python version py2.py3  Upload date  Hashes View hashes 
Filename, size pasio1.0.0.tar.gz (22.1 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Hashes for pasio1.0.0py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  8850356eca9066deb8882e4071e6d68a0115636be8b7e40f90279a8ff3c167e3 

MD5  6818a588c306c8fa2ad9a1976182e829 

BLAKE2256  f2f603c31dd4f89e2b2aa6697c4c1dfd0fd97012bb504fcb4b278b0c4e93dfac 