Skip to main content

No project description provided

Project description

BetaNegBinFit

A very brief manual

The cornerstones (or rather, to be more precise, parts that are supposed to be used by a user, rather than a developer) of BetaNegBinFit are model classes that do model certain distribution and do some heavy lifting. At the moment, there are 2 models available:

  • ModelMixture -- a model that models counts at a certain slice as a mixture of 2 binomial-alike distributions;
  • ModelLine -- this can be thought of as a composition of a lot of ModelMixtures (their number is equal to a number of slices), but they are linked via constraining r parameter to a linear function of slice.

Both models can use either negative-binomial or beta-negative-binomial distribution (see model argument of their __init__ methods).

Use example: *ModelMixture"

Running ModelMixture is as simple as:

from betanegbinfit import ModelMixture
m = ModelMixture(bad=2, left=4)
res = m.fit(some_slice)

Then, you can inspect parameters through examining the res variable which is a fairly self-explanotory dict.

Some_slice?

Assume that we want to get slice of refs with fix__c = 23 for BAD=3 for our chipseq-dataset, some_slice. We suggest doing it this way:

data_folder = 'Data'
data_file = os.path.join(data_folder, 'chipseq.tsv')

bad = 3
fix_c = 23
dfo = pd.read_csv(data_file, sep='\t')
dfo = dfo[dfo.BAD == bad]
refs = dfo.REF_COUNTS
alts = dfo.ALT_COUNTS
some_slice = refs[alts == c]

Use example: ModelLine

ModelLine is ran similarly, but this time we pass whole data to the fit method instead of a single slice:

from betanegbinfit import ModelLine
m = ModelLine(bad=2, left=4)
res = m.fit(data)

We advise that data is a n x 2 numpy array rather than pandas DataFrame (where the 1st column stands for reference allele counts and the 2nd for alt counts), however if that is not the case, ModelLine will try to guess ref count, alt count and BAD columns from the dataframe.

Statistics

stats module has a number of functions that can be of interest to a prospective user:

  1. rmsea - calculate RMSEA goodness-of-fit statistic;
  2. calc_pvalues - calculate p-value for each of snp;
  3. calc_eff_sizes - calculate "effect sizes" for each of snp;
  4. calc_adjusted_loglik - calcualte adjusted loglikelihood: adjusted loglikelihood is just a likelihood correct for its parameters geometry. It is done vis subtracting logdet of Fisher information matrix.

Automatic everything & multiprocessing

However, instead of manually creating instances of model classes and working through BetaNegBinFit methods, it might be much more preferential to run a single to-use function. The package has utils.run function that is very easy to use and also does parallelization. See test.py for a real (and a very short one) example. Most importantly, it produces tabular data that can be easily analyzed in a downstream analysis.

Also, it has plenty of arguments that can be taked advantage of to do some preprocessing which might be crucial for some datasets.

Please note, that all functions have plenty of optional arguments and they all are documented, so please consider reading through help(function of interest).

A note on performance

As far as we are concerned, BetaNegBinFit should work within a manageable amounts of time. For insance, when ModelLine with model='BetaNB' ran against chipseq.tsv dataset, it finishes in 6 minutes when ran at Ryzen 5600U. It does so under 2 minutes with model='NB'.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

betanegbinfit-1.10.2.tar.gz (53.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page