Skip to main content

Statistical Methods for Data Science Toolkit

Project description

Statistical Methods for Data Science (statsToolkit)

The project is sponsored by Malmö Universitet developed by Eng. Marco Schivo and Eng. Alberto Biscalchin under the supervision of Associete Professor Yuanji Cheng and is released under the MIT License. It is open source and available for anyone to use and contribute to.

Internal course Code reference: MA660E


Descriptive Statistics

This module contains basic descriptive statistics functions that allow users to perform statistical analysis on numerical datasets. The functions are designed for flexibility and ease of use, and they provide essential statistical metrics such as mean, median, range, variance, standard deviation, and quantiles.

Functions Overview

1. mean(X)

Calculates the arithmetic mean (average) of a list of numbers.

Example Usage:

from statstoolkit.statistics import mean

data = [1, 2, 3, 4, 5]
print(mean(data))  # Output: 3.0

2. median(X)

Calculates the median of a list of numbers. The median is the middle value that separates the higher half from the lower half of the dataset.

Example Usage:

from statstoolkit.statistics import median

data = [1, 2, 3, 4, 5]
print(median(data))  # Output: 3

3. range_(X)

Calculates the range, which is the difference between the maximum and minimum values in the dataset.

Example Usage:

from statstoolkit.statistics import range_

data = [1, 2, 3, 4, 5]
print(range_(data))  # Output: 4

4. var(X, ddof=0)

Calculates the variance of the dataset. Variance measures the spread of the data from the mean. You can calculate both population variance (ddof=0) or sample variance (ddof=1).

Example Usage:

from statstoolkit.statistics import var

data = [1, 2, 3, 4, 5]
print(var(data))  # Output: 2.0  (Population variance)
print(var(data, ddof=1))  # Output: 2.5  (Sample variance)

5. std(X, ddof=0)

Calculates the standard deviation, which is the square root of the variance. It indicates how much the data varies from the mean.

Example Usage:

from statstoolkit.statistics import std

data = [1, 2, 3, 4, 5]
print(std(data))  # Output: 1.4142135623730951  (Population standard deviation)
print(std(data, ddof=1))  # Output: 1.5811388300841898  (Sample standard deviation)

6. quantile(X, Q)

Calculates the quantile, which is the value below which a given percentage of the data falls. For example, the 0.25 quantile is the first quartile (25th percentile).

Example Usage:

from statstoolkit.statistics import quantile

data = [1, 2, 3, 4, 5]
print(quantile(data, 0.25))  # Output: 2.0 (25th percentile)
print(quantile(data, 0.5))  # Output: 3.0 (Median)
print(quantile(data, 0.75))  # Output: 4.0 (75th percentile)

7. corrcoef(x, y, alternative='two-sided', method=None)

Calculates the Pearson correlation coefficient between two datasets and provides the p-value for testing non-correlation.

Example Usage:

from statstoolkit.statistics import corrcoef

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
R, P = corrcoef(x, y)
print("Correlation coefficient:", R)
print("P-value matrix:", P)

8. partialcorr(X, columns=None)

Computes the partial correlation matrix, controlling for the influence of all other variables.

Example Usage:

from statstoolkit.statistics import partialcorr
import pandas as pd

data = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [2, 4, 6, 8, 10],
    'C': [5, 6, 7, 8, 9]
})
partial_corr_matrix = partialcorr(data)
print(partial_corr_matrix)

9. cov(data1, data2=None)

Calculates the covariance matrix between two datasets or within a single dataset.

Example Usage:

from statstoolkit.statistics import cov

data1 = [1, 2, 3, 4]
data2 = [2, 4, 6, 8]
print(cov(data1, data2))

10. fitlm(x, y)

Performs simple linear regression of y on x, returning a dictionary of regression results.

Example Usage:

from statstoolkit.statistics import fitlm

x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
result = fitlm(x, y)
print(result)

11. anova(y=None, factors=None, data=None, formula=None, response=None, sum_of_squares='type I')

Performs one-way, two-way, or N-way Analysis of Variance (ANOVA) on data, supporting custom models.

Example Usage:

from statstoolkit.statistics import anova
import pandas as pd

data = pd.DataFrame({
    'y': [23, 25, 20, 21],
    'A': ['High', 'Low', 'High', 'Low'],
    'B': ['Type1', 'Type2', 'Type1', 'Type2']
})
result = anova(y='y', data=data, formula='y ~ A + B + A:B')
print(result)

12. kruskalwallis(x, group=None, displayopt=False)

Performs the Kruskal-Wallis H-test for independent samples, a non-parametric alternative to one-way ANOVA.

Example Usage:

from statstoolkit.statistics import kruskalwallis

x = [1.2, 3.4, 5.6, 1.1, 3.6, 5.5]
group = ['A', 'A', 'A', 'B', 'B', 'B']
p_value, anova_table, stats = kruskalwallis(x, group=group, displayopt=True)
print("P-value:", p_value)
print(anova_table)

Visualization Functions

This module contains several flexible visualization functions built using Matplotlib and Seaborn, allowing users to generate commonly used plots such as bar charts, pie charts, histograms, boxplots, and scatter plots. The visualizations are designed for customization, giving the user control over various parameters such as color, labels, figure size, and more.

Functions Overview

1. bar_chart(x, y, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Creates a bar chart with customizable x-axis labels, y-axis values, title, colors, and more.

Example Usage:

from statstoolkit.visualization import bar_chart

categories = ['Category A', 'Category B', 'Category C']
values = [10, 20, 30]

bar_chart(categories, values, title="Example Bar Chart", xlabel="Category", ylabel="Values", color="blue")

Example Bar Chart


2. pie_chart(sizes, labels=None, title=None, colors=None, explode=None, autopct='%1.1f%%', shadow=False, startangle=90, **kwargs)

Creates a pie chart with options for custom labels, colors, explode effect, and more.

Example Usage:

from statstoolkit.visualization import pie_chart

sizes = [15, 30, 45, 10]
labels = ['Category A', 'Category B', 'Category C', 'Category D']

pie_chart(sizes, labels=labels, title="Example Pie Chart", autopct='%1.1f%%', shadow=True)

Example Pie Chart


3. histogram(x, bins=10, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Creates a histogram for visualizing the frequency distribution of data. The number of bins or bin edges can be adjusted.

Example Usage:

from statstoolkit.visualization import histogram

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]

histogram(data, bins=4, title="Example Histogram", xlabel="Value", ylabel="Frequency", color="green")

Example Histogram


4. boxplot(MPG, origin=None, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Creates a boxplot for visualizing the distribution of a dataset. It supports optional grouping variables, such as plotting the distribution of MPG (miles per gallon) by car origin.

Example Usage:

from statstoolkit.visualization import boxplot

MPG = [15, 18, 21, 24, 30]
origin = ['USA', 'Japan', 'Japan', 'USA', 'Europe']

boxplot(MPG, origin=origin, title="Example Boxplot by Car Origin", xlabel="Origin", ylabel="Miles per Gallon")

Example Boxplot


5. scatterplot(x, y, z=None, symbol='o', title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Creates a scatter plot, optionally supporting 3D-like plots where a third variable z can be mapped to point sizes or colors. Marker symbols and other plot properties can be customized.

Example Usage (2D Scatter Plot):

from statstoolkit.visualization import scatterplot

x = [1, 2, 3, 4]
y = [10, 20, 30, 40]

scatterplot(x, y, title="Example 2D Scatter Plot", xlabel="X-Axis", ylabel="Y-Axis", color="red")

Example 2D Scatter Plot

Example Usage (3D-like Scatter Plot):

from statstoolkit.visualization import scatterplot_3d

x = [1, 2, 3, 4]
y = [10, 20, 30, 40]
z = [50, 100, 200, 300]

scatterplot_3d(x, y, z, symbol="o", title="Example 3D Scatter Plot", xlabel="X-Axis", ylabel="Y-Axis", zlabel="Z-Axis", color="blue")

Example 3D Scatter Plot


Probability

1. binopdf(k, n, p)

Calculates the binomial probability mass function for given trials and success probability.

Example Usage:

from statstoolkit.probability import binopdf

k = 3  # number of successes
n = 10  # number of trials
p = 0.5  # probability of success
print(binopdf(k, n, p))  # Output: probability of exactly 3 successes

2. poisspdf(k, mu)

Calculates the Poisson probability mass function for given mean number of events.

Example Usage:

from statstoolkit.probability import poisspdf

k = 5  # number of events
mu = 3  # mean number of events
print(poisspdf(k, mu))  # Output: probability of exactly 5 events

3. geopdf(k, p)

Calculates the geometric probability mass function for the number of trials needed to get the first success.

Example Usage:

from statstoolkit.probability import geopdf

k = 3  # number of trials
p = 0.5  # probability of success
print(geopdf(k, p))  # Output: probability of success on the 3rd trial

4. nbinpdf(k, r, p)

Calculates the negative binomial probability mass function, where k is the number of failures until r successes occur.

Example Usage:

from statstoolkit.probability import nbinpdf

k = 4  # number of failures
r = 2  # number of successes
p = 0.5  # probability of success
print(nbinpdf(k, r, p))  # Output: probability of exactly 4 failures before 2 successes

5. hygepdf(k, M, n, N)

Calculates the hypergeometric probability mass function for the probability of drawing k successes from a population of M, with n successes in the sample.

Example Usage:

from statstoolkit.probability import hygepdf

k = 3  # successes in sample
M = 50  # population size
n = 20  # sample size
N = 10  # number of successes in population
print(hygepdf(k, M, n, N))  # Output: probability of drawing 3 successes

6. betapdf(x, a, b)

Calculates the beta probability density function.

Example Usage:

from statstoolkit.probability import betapdf

x = 0.5
a = 2  # shape parameter alpha
b = 2  # shape parameter beta
print(betapdf(x, a, b))  # Output: beta distribution density at x=0.5

7. chi2pdf(x, df)

Calculates the chi-squared probability density function.

Example Usage:

from statstoolkit.probability import chi2pdf

x = 5
df = 3  # degrees of freedom
print(chi2pdf(x, df))  # Output: chi-squared density at x=5

8. exppdf(x, scale)

Calculates the exponential probability density function.

Example Usage:

from statstoolkit.probability import exppdf

x = 2
scale = 1  # inverse of rate parameter lambda
print(exppdf(x, scale=scale))  # Output: exponential density at x=2

9. fpdf(x, dfn, dfd)

Calculates the F-distribution probability density function.

Example Usage:

from statstoolkit.probability import fpdf

x = 1.5
dfn = 5  # degrees of freedom numerator
dfd = 2  # degrees of freedom denominator
print(fpdf(x, dfn, dfd))  # Output: F-distribution density at x=1.5

10. normpdf(x, mu, sigma)

Calculates the normal (Gaussian) probability density function.

Example Usage:

from statstoolkit.probability import normpdf

x = 0
mu = 0  # mean
sigma = 1  # standard deviation
print(normpdf(x, mu, sigma))  # Output: normal density at x=0

11. lognpdf(x, s, scale)

Calculates the log-normal probability density function.

Example Usage:

from statstoolkit.probability import lognpdf

x = 1.5
s = 0.5  # shape parameter
scale = 1  # scale parameter
print(lognpdf(x, s, scale=scale))  # Output: log-normal density at x=1.5

12. tpdf(x, df)

Calculates the Student's t probability density function.

Example Usage:

from statstoolkit.probability import tpdf

x = 2
df = 10  # degrees of freedom
print(tpdf(x, df))  # Output: t-distribution density at x=2

13. wblpdf(x, c, scale)

Calculates the Weibull probability density function.

Example Usage:

from statstoolkit.probability import wblpdf

x = 1.2
c = 2  # shape parameter
scale = 1  # scale parameter
print(wblpdf(x, c, scale=scale))  # Output: Weibull density at x=1.2

License

This project is licensed under the MIT License - see the LICENSE file for details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statsToolkit-1.0.0.tar.gz (28.4 kB view hashes)

Uploaded Source

Built Distribution

statsToolkit-1.0.0-py3-none-any.whl (28.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page