Skip to main content

A python package to compute kernel-density-estimated likelihood

Project description

KDE-Likelihood

A python package to compute kernel-density-estimated likelihood.

Installation

To compile and use this package, you need python 3, pip, and a c++ compiler. On Linux, the compiler should not be a concern; on Windows, you need the Visual Studio 2019 compiler, which you may get by downlaoding and installing the Buildtools for Visual Studio.

If you have python and pip installed and available on the PATH, navigate to the project folder in which you find setup.py. Execute

pip install .

Usage example:

import numpy as np
from kdelikelihood import ParallelLikelihoodComputer, get_bandwidth_by_silverman


# Some model for simulating the data
def model(parameters, sampleSize=10000):
    result = np.zeros((sampleSize, 3))

    # Suppose the first column contains some count data
    result[:, 0] = np.random.poisson(parameters[0], size=sampleSize)

    # Suppose the second column contains some positive data
    result[:, 1] = np.random.randn(sampleSize) ** 2 * parameters[1] + result[:, 0]

    # Suppose the third column contains some unconstrained data
    result[:, 2] = np.random.randn(sampleSize) * result[:, 1] * parameters[2]

    return result


# The observed data (each row is an independent sample)
observedData = np.array(
    [
        [1.0, 1.11921911, -0.71177555],
        [6.0, 6.02855478, 8.37566854],
        [3.0, 9.57724975, 6.41826056],
        [5.0, 5.64429256, 6.24338937],
        [8.0, 9.57156202, 7.87944723],
        [4.0, 4.84531121, -0.65146309],
        [7.0, 7.03391557, -8.34715355],
        [3.0, 3.04287902, 0.92252008],
        [7.0, 7.00074299, 0.98825773],
        [5.0, 5.83488089, -6.66077578],
        [6.0, 6.26802035, -6.32007646],
        [4.0, 4.00099543, -5.72002288],
        [2.0, 2.61069588, 1.98710941],
        [7.0, 8.08870818, -11.84123493],
        [4.0, 4.20187965, 2.96009114],
        [2.0, 3.21568321, -0.51097464],
        [3.0, 9.08856868, 10.7647393],
        [5.0, 5.00020691, -4.58271643],
        [5.0, 11.4222031, 5.95202002],
        [5.0, 5.64400458, -4.22623137],
        [6.0, 7.08425759, -3.3731975],
        [4.0, 4.59795136, -4.37783711],
        [4.0, 7.20161146, -3.36315821],
        [3.0, 3.00051945, -2.39977869],
        [1.0, 3.05606476, -1.18190334],
        [2.0, 2.18191751, 2.29985054],
        [5.0, 5.11884191, -0.74355017],
        [5.0, 5.92561278, -8.76376851],
        [2.0, 4.19454749, 13.87107752],
        [5.0, 5.00930303, -0.34471651],
    ]
)

# The type of data we are considering in each column:
# 0: unconstrained real numbers
# 1: non-negative real numbers
# 2: natural numbers
modes = [2, 1, 0]

# The size of the sample we are generating for each parameter set
# to estimate the likelihood
sampleSize = 10000

# Bandwidths. Smaller values mean reduced bias if the sample is large enough.
# Bigger values mean reduced stochasticity of the result.
# As a start, we could use the rule of thumb of Silverman
bandwidths = get_bandwidth_by_silverman(observedData, sampleSize)

# Tolerance for the result
# This is not really a rigorous quantity. Bigger values make
# computations faster; smaller values increase the accuracy
atol = 1e-10

# To avoid the "curse of dimensionality", we can consider parts of the data
# as independent. However, since in our examples the data are interdependent,
# we say they all belong to the same group. That is, we have one group only.
dataGroups = 1

# Set up a likelihood computer based on the observed data
with ParallelLikelihoodComputer(observedData, bandwidths, modes, dataGroups, atol) as cmp:
    # Some parameters:
    # (here, this are the true parameters used to generate the data set, but
    #  in practice, we would not know of course)
    parameters = [5, 1, 1]

    # Get some data set from a simulation (each row is a "quasi-independent" sample)
    simulatedData = model(parameters)

    # Compute the likelihood
    logLikelihood = cmp.compute_log_likelihood(simulatedData)

    print("Ln(Likelihood):", logLikelihood)

    # To optimize this, we could use some stochastic optimization algorithm.
    # This could be pybobyqa (https://github.com/numericalalgorithmsgroup/pybobyqa/)
    import pybobyqa

    # Define the objective function
    def fun(parameters):
        # Generate a sample from the model
        sample = model(parameters, sampleSize)

        # Return the negative log-likelihood, as pybobyqa is a
        # minimization algorithm
        return -cmp.compute_log_likelihood(sample)

    # Define some bounds for the parameters
    bounds = [[1e-10, 0, 0], [20, 20, 20]]

    # An initial guess
    x0 = [10, 10, 10]

    result = pybobyqa.solve(
        fun,
        x0,
        objfun_has_noise=True,
        bounds=bounds,
        maxfun=1000,
        seek_global_minimum=True,
        print_progress=True,
    )

    print(result)

    # For comparison, the likelihood estimated based on the true parameters:
    print("Likelihood of the original parameters:", fun(parameters))

    # The estimated parameters are not fully accurate. One reason may be
    # that the optimizer failed to obtain parameters as good as the
    # correct ones. Another potential reason is the small data set of `observedData`,
    # making the parameters unestimable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kdelikelihood-0.1.0b1.tar.gz (122.2 kB view hashes)

Uploaded Source

Built Distribution

kdelikelihood-0.1.0b1-cp311-cp311-win_amd64.whl (160.6 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page