Calculate the relationship between parameters with known distributions.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Distrel

Distrel (distribution relations) is a package used to approximate the relationship between dependent distributions when only the properties of the distribution are known.

Example use case

Suppose we have 3 parameters with known distributions, such that: $a \sim \mathcal{N}(\mu_a, \sigma_a)$, $b \sim \mathcal{N}(\mu_b, \sigma_b)$ and $c \sim \mathcal{N}(\mu_c, \sigma_c)$ and distributions are related by $c = a \times b$.

Distrel attempts to solve the question, how do we correctly pseudo-randomly sample from $a$ and $b$ such then $c$ follows the correct distribution.

How it works

Distrel is a wrapper around a simple neural network written in pytorch that finds the function $f_{\mu}(a)$ and $f_{\sigma}(a)$ from:

$b \sim \mathcal{N}(f_{\mu}(a), f_{\sigma}(a))$

such that:

$a \times b \sim \mathcal{N}(\mu_c, \sigma_c)$

That is, Distrel assumes that $a$ and $b$ are weakly related and hence samples $b$ from a normal distribution defined by a particular value of $a$.

Below is a diagram of the distrel neural network architecture:

A diagram of the network architectures

Installation

With poetry

poetry add distrel

or with pip

pip install distrel

Usage

To make things interesting, let's consider the relationship $a \times b = c$ but each distribution is a unique non-central t-distribution. The key thing to note is that non-central t-distributions are defined by their degrees of freedom and a non-centrality parameter. We could use any distribution but currently distributions can only be characterised by their mean, variance, skewness and kurtosis - if more properties are required then please do open an issue.

from scipy.stats import nct
import distrel

# Defines the degrees of freedom (df) and non-centrality parameter (nc) for each distribution
a_df, a_nc = 9, 52
b_df, b_nc = 4, 30
c_df, c_nc = 143, 90

The distribution relation network (drn) needs three things: - A generator function to generate the input data. - A calculator function to obtain the 3rd distribution from the output of the generator and the output of the neural network. - The properties of each of the 3 distributions.

# We need to define a generator function that can generate samples of A.
# It must take N (the number of samples as it's only argument.
# It doesn't matter that gen_a doesn't return a torch tensor, this is accounted for internally.
def gen_a(N):
    return nct.rvs(a_df, a_nc, size=N)
    
# We also need to define a function that calculates $c$ from $a$ and $b$.
# That is, the neural network will take input $a$, generate $b$ and then calculate $c$ from $a$ and 
# $b$.
def calc_c(a, b):
    return a * b
    
# Now we define the distribution relation network
drn = distrel.drn(gen=gen_a, calc=calc_c, seed=42)



# We need to define the distribution properties with the following:
# nct.stats returns mean, var, skew, kurt
a_mean, a_var, a_skew, a_kurt = nct.stats(a_df, a_nc, moments='mvsk')
b_mean, b_var, b_skew, b_kurt = nct.stats(b_df, b_nc, moments='mvsk')
c_mean, c_var, c_skew, c_kurt = nct.stats(c_df, c_nc, moments='mvsk')

drn.set_properties(
    {"mean": a_mean, "var": a_var, "skew": a_skew, "kurt": a_kurt}, 
    {"mean": b_mean, "var": b_var, "skew": b_skew, "kurt": b_kurt}, 
    {"mean": c_mean, "var": c_var, "skew": c_skew, "kurt": c_kurt},
)
# We could have alternatively initialised the drn with the following
# drn = distrel.drn(
#    gen=gen_a, 
#    calc=calc_c, 
#    properties=[
#        {"mean": a_mean, "var": a_var},
#        {"mean": b_mean, "var": b_var},
#        {"mean": c_mean, "var": c_var},
#    ],
#)
# This also shows that we don't /have/ to test for all four metrics.
# Currently, only mean, variance, skewness and kurtosis are supported.

Now the drn is ready to train, a tolerance can be provided to allow for early stopping if ALL of the distribution properties are within a percentage tolerance. The tolerance defaults to zero which means no early stopping occurs. Additionally, the maximum number of epochs can be set (this defaults to 100) and then drn will stop training upon reaching this number regardless of performance.

The training starts off with gradient free optimisation before switching to a gradient based optimiser. To disable the gradient free optimiser, set budget to 1.

You can set the number of samples generated by the generator with N.

For reproducible results, you can also set the seed which sets the seed for pytorch, numpy and python's random.

drn.train returns a boolean, True if the training exited via to early stopping due to convergence, and False otherwise.

converge = drn.train(
    max_epochs=1000, 
    tol=1e-2, 
    progress_bar=False,
    lr=1e-3
    budget=1000,
    N=1000,
    seed=42,
)

If you want more fine grain controlled over the optimiser - you can do that!

from torch.optim import LBFGS

converge = drn.train(
    max_epochs=max_epochs,
    tol=tol,
    lr=lr,
    progress_bar=True,
    require_closure=True,
    optim=LBFGS,
    optim_kwargs={'line_search_fn': 'strong_wolfe'},
)

print(f"Converged?:\t{converge}")

The optim_kwargs are unpacked when calling the optimiser. When the optimiser is called internally, set require_closure to True if you'd like to call:

optim.step(closure)

or require_closure to False if you'd like to call:

optim.step()

# To view the network waits simply print the drn
print(drn)

Which returns something like:

        _l2_ mu
in _l1_/
       \_l3_ sigma
out ~ N(mu, sigma)
---

l1:
Linear(in_features=1, out_features=1, bias=True)
Bias: -3.3627309799194336
Weight: 0.027783172205090523

l2:
Linear(in_features=1, out_features=1, bias=True)
Bias: 0.14515426754951477
Weight: 3.0655031204223633

l3:
Linear(in_features=1, out_features=1, bias=True)
Bias: -13.486244201660156
Weight: 2.5820298194885254

# To use the drn as a predictor, call the predict method, a seed is optional but recommend for.
# reproducible results.
new_a = 37
new_b = drn.predict(37, seed=42)
new_c = drn.calc(new_a, new_b) # Wrapps around the calc_c function

Contributing

Please feel free to submit an issue for something you'd like to see fixed or implemented. Alternatively merge requests are always welcome!

Copying

This project is licenced under the GNU GPL version 3 license - to read the license see LICENSE.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.1

Dec 14, 2023

0.2.0

Dec 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distrel-0.2.1.tar.gz (18.2 kB view hashes)

Uploaded Dec 14, 2023 Source

Built Distribution

distrel-0.2.1-py3-none-any.whl (19.0 kB view hashes)

Uploaded Dec 14, 2023 Python 3

Hashes for distrel-0.2.1.tar.gz

Hashes for distrel-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`8f65ef0502ffd32b16d49abffe2510a01f7eb7ebbaf2d5b12838327d17a4d5a5`
MD5	`6a731e84be0e978098f93fdf9c1b821d`
BLAKE2b-256	`5af8486dd86cd4f5d6fbba63fe2810b6ce802cfb71acab892fd00581d910bbac`

Hashes for distrel-0.2.1-py3-none-any.whl

Hashes for distrel-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7b601cb5012591e9cb642cf050c92168777ff02f890d6f6b54a71cc7cc04427`
MD5	`9214ab6e8dae9c0049b33410f189469f`
BLAKE2b-256	`bafe7f5eeafd82743825ea3bcff35addcc2a9b86ff6f15f9487a319cf4ac3769`