Statistical Methods for Data Science Toolkit

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Statistical Methods for Data Science (statsToolkit)

The project is sponsored by Malmö Universitet developed by Eng. Marco Schivo and Eng. Alberto Biscalchin under the supervision of Associete Professor Yuanji Cheng and is released under the MIT License. It is open source and available for anyone to use and contribute to.

Internal course Code reference: MA660E

Descriptive Statistics

This module contains basic descriptive statistics functions that allow users to perform statistical analysis on numerical datasets. The functions are designed for flexibility and ease of use, and they provide essential statistical metrics such as mean, median, range, variance, standard deviation, and quantiles.

Functions Overview

1. `mean(X)`

Calculates the arithmetic mean (average) of a list of numbers.

Example Usage:

from statstoolkit.statistics import mean

data = [1, 2, 3, 4, 5]
print(mean(data))  # Output: 3.0

2. `median(X)`

Calculates the median of a list of numbers. The median is the middle value that separates the higher half from the lower half of the dataset.

Example Usage:

from statstoolkit.statistics import median

data = [1, 2, 3, 4, 5]
print(median(data))  # Output: 3

3. `range_(X)`

Calculates the range, which is the difference between the maximum and minimum values in the dataset.

Example Usage:

from statstoolkit.statistics import range_

data = [1, 2, 3, 4, 5]
print(range_(data))  # Output: 4

4. `var(X, ddof=0)`

Calculates the variance of the dataset. Variance measures the spread of the data from the mean. You can calculate both population variance (ddof=0) or sample variance (ddof=1).

Example Usage:

from statstoolkit.statistics import var

data = [1, 2, 3, 4, 5]
print(var(data))  # Output: 2.0  (Population variance)
print(var(data, ddof=1))  # Output: 2.5  (Sample variance)

Here's the updated README section with the comparative table and an adjusted usage example:

5. `std(A, w=0, dim=None, vecdim=None, missingflag=None)`

Calculates the standard deviation similar to MATLAB's std function, with options for dimension selection, weighting, and handling of missing values.

Key Differences Between `std` Implementation and MATLAB's `std`

The Python std function has been customized to closely replicate MATLAB’s behavior, but certain nuances exist:

Multidimensional Array Handling: MATLAB handles 3D and 4D arrays by operating on the first non-singleton dimension by default, while Python’s version requires specifying dimensions to precisely match MATLAB’s output, especially for slice-by-slice or full-array calculations.
NaN Treatment: Both implementations allow for ignoring or including NaN values, but MATLAB’s “omitmissing” and “includemissing” options directly influence the calculations, while Python requires NaN handling in pre-processing to fully mimic MATLAB’s exact treatment.
Weighted Standard Deviation: MATLAB’s std includes weighted standard deviation handling across rows, columns, or other dimensions. Our Python version achieves similar results when weights are provided, but some variation can occur depending on array shape and dimension settings.
Dimension-Specific Calculations: MATLAB’s function allows for flexible calculations over specified dimensions, including non-contiguous dimensions (vecdim). Python handles contiguous dimensions but may yield different shapes unless explicitly set to match MATLAB’s squeeze behavior.

Comparison Table of `std` Results Between MATLAB and Python Implementations

Test	Description	MATLAB Result	Python Result	Match?
1	1D Array (Default)	7.9057	(7.905694150420948, 20.0)	✅
2	1D Array with Weight (w=1)	[212.1320, 212.1320, 212.1320]	[212.13203436, 212.13203436, 212.13203436]	✅
3	3D Array, default (slice-by-slice)	[2.5000, 7.7460, 4.5735] (3 slices)	[5.50151494, 5.4283208]	❌
4	2D Array, Default	6.2361	(6.236095644623236, 18.333333333333332)	✅
5	2D Array with NaN	7.0711	(7.0710678118654755, 60.0)	✅
6	2D Array, Row-wise	10	(7.0710678118654755, 20.0)	❌
7	2D Array, Weighted, w=1	[[1.0000, 2.0817, 3.2146], [1.0000, 1.0000, 1.5275]]	[[2.0000, 2.0000], [4.0000, 3.0000], [4.0000, 3.0550]]	❌
8	3D Array with NaN, omitmissing	0	(0.0, 1000.0)	✅
9	2D Array with NaN, includemissing	2.7386	(2.7386127875258306, 5.0)	✅
10	All Elements, 2D Array	9.9897	(9.87527045694513, 49.54743292509804)	❌

Example Usage:

from statstoolkit.statistics import std

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
# Default standard deviation
std_result, mean_result = std(data)
print("Standard Deviation:", std_result)
print("Mean:", mean_result)

# With dimension specified
std_dim_result, mean_dim_result = std(data, dim=1)
print("Standard Deviation along dimension 1:", std_dim_result)
print("Mean along dimension 1:", mean_dim_result)

# With NaN handling
data_with_nan = [[1, np.nan, 3], [4, 5, np.nan]]
std_nan_result, mean_nan_result = std(data_with_nan, missingflag="omitmissing")
print("Standard Deviation with NaN omitted:", std_nan_result)
print("Mean with NaN omitted:", mean_nan_result)

6. `quantile(X, Q)`

Calculates the quantile, which is the value below which a given percentage of the data falls. For example, the 0.25 quantile is the first quartile (25th percentile).

Example Usage:

from statstoolkit.statistics import quantile

data = [1, 2, 3, 4, 5]
print(quantile(data, 0.25))  # Output: 2.0 (25th percentile)
print(quantile(data, 0.5))  # Output: 3.0 (Median)
print(quantile(data, 0.75))  # Output: 4.0 (75th percentile)

7. `corrcoef(x, y, alternative='two-sided', method=None)`

Calculates the Pearson correlation coefficient between two datasets and provides the p-value for testing non-correlation.

Example Usage:

from statstoolkit.statistics import corrcoef

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
R, P = corrcoef(x, y)
print("Correlation coefficient:", R)
print("P-value matrix:", P)

8. `partialcorr(X, columns=None)`

Computes the partial correlation matrix, controlling for the influence of all other variables.

Example Usage:

from statstoolkit.statistics import partialcorr
import pandas as pd

data = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [2, 4, 6, 8, 10],
    'C': [5, 6, 7, 8, 9]
})
partial_corr_matrix = partialcorr(data)
print(partial_corr_matrix)

9. `cov(data1, data2=None)`

Calculates the covariance matrix between two datasets or within a single dataset.

Example Usage:

from statstoolkit.statistics import cov

data1 = [1, 2, 3, 4]
data2 = [2, 4, 6, 8]
print(cov(data1, data2))

10. `fitlm(x, y)`

Performs simple linear regression of y on x, returning a dictionary of regression results.

Example Usage:

from statstoolkit.statistics import fitlm

x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
result = fitlm(x, y)
print(result)

11. `anova(y=None, factors=None, data=None, formula=None, response=None, sum_of_squares='type I')`

Performs one-way, two-way, or N-way Analysis of Variance (ANOVA) on data, supporting custom models.

Example Usage:

from statstoolkit.statistics import anova
import pandas as pd

data = pd.DataFrame({
    'y': [23, 25, 20, 21],
    'A': ['High', 'Low', 'High', 'Low'],
    'B': ['Type1', 'Type2', 'Type1', 'Type2']
})
result = anova(y='y', data=data, formula='y ~ A + B + A:B')
print(result)

12. `kruskalwallis(x, group=None, displayopt=False)`

Performs the Kruskal-Wallis H-test for independent samples, a non-parametric alternative to one-way ANOVA.

Example Usage:

from statstoolkit.statistics import kruskalwallis

x = [1.2, 3.4, 5.6, 1.1, 3.6, 5.5]
group = ['A', 'A', 'A', 'B', 'B', 'B']
p_value, anova_table, stats = kruskalwallis(x, group=group, displayopt=True)
print("P-value:", p_value)
print(anova_table)

Visualization Functions

This module contains several flexible visualization functions built using Matplotlib and Seaborn, allowing users to generate commonly used plots such as bar charts, pie charts, histograms, boxplots, and scatter plots. The visualizations are designed for customization, giving the user control over various parameters such as color, labels, figure size, and more.

Functions Overview

1. `bar_chart(x, y, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

Creates a bar chart with customizable x-axis labels, y-axis values, title, colors, and more.

Example Usage:

from statstoolkit.visualization import bar_chart

categories = ['Category A', 'Category B', 'Category C']
values = [10, 20, 30]

bar_chart(categories, values, title="Example Bar Chart", xlabel="Category", ylabel="Values", color="blue")

Example Bar Chart

2. `pie_chart(sizes, labels=None, title=None, colors=None, explode=None, autopct='%1.1f%%', shadow=False, startangle=90, **kwargs)`

Creates a pie chart with options for custom labels, colors, explode effect, and more.

Example Usage:

from statstoolkit.visualization import pie_chart

sizes = [15, 30, 45, 10]
labels = ['Category A', 'Category B', 'Category C', 'Category D']

pie_chart(sizes, labels=labels, title="Example Pie Chart", autopct='%1.1f%%', shadow=True)

Example Pie Chart

3. `histogram(x, bins=10, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

Creates a histogram for visualizing the frequency distribution of data. The number of bins or bin edges can be adjusted.

Example Usage:

from statstoolkit.visualization import histogram

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]

histogram(data, bins=4, title="Example Histogram", xlabel="Value", ylabel="Frequency", color="green")

Example Histogram

4. `boxplot(MPG, origin=None, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

Creates a boxplot for visualizing the distribution of a dataset. It supports optional grouping variables, such as plotting the distribution of MPG (miles per gallon) by car origin.

Example Usage:

from statstoolkit.visualization import boxplot

MPG = [15, 18, 21, 24, 30]
origin = ['USA', 'Japan', 'Japan', 'USA', 'Europe']

boxplot(MPG, origin=origin, title="Example Boxplot by Car Origin", xlabel="Origin", ylabel="Miles per Gallon")

Example Boxplot

5. `scatterplot(x, y, z=None, symbol='o', title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

Creates a scatter plot, optionally supporting 3D-like plots where a third variable z can be mapped to point sizes or colors. Marker symbols and other plot properties can be customized.

Example Usage (2D Scatter Plot):

from statstoolkit.visualization import scatterplot

x = [1, 2, 3, 4]
y = [10, 20, 30, 40]

scatterplot(x, y, title="Example 2D Scatter Plot", xlabel="X-Axis", ylabel="Y-Axis", color="red")

Example 2D Scatter Plot

Example Usage (3D-like Scatter Plot):

from statstoolkit.visualization import scatterplot_3d

x = [1, 2, 3, 4]
y = [10, 20, 30, 40]
z = [50, 100, 200, 300]

scatterplot_3d(x, y, z, symbol="o", title="Example 3D Scatter Plot", xlabel="X-Axis", ylabel="Y-Axis", zlabel="Z-Axis", color="blue")

Example 3D Scatter Plot

Probability

1. `binopdf(k, n, p)`

Calculates the binomial probability mass function for given trials and success probability.

Example Usage:

from statstoolkit.probability import binopdf

k = 3  # number of successes
n = 10  # number of trials
p = 0.5  # probability of success
print(binopdf(k, n, p))  # Output: probability of exactly 3 successes

2. `poisspdf(k, mu)`

Calculates the Poisson probability mass function for given mean number of events.

Example Usage:

from statstoolkit.probability import poisspdf

k = 5  # number of events
mu = 3  # mean number of events
print(poisspdf(k, mu))  # Output: probability of exactly 5 events

3. `geopdf(k, p)`

Calculates the geometric probability mass function for the number of trials needed to get the first success.

Example Usage:

from statstoolkit.probability import geopdf

k = 3  # number of trials
p = 0.5  # probability of success
print(geopdf(k, p))  # Output: probability of success on the 3rd trial

4. `nbinpdf(k, r, p)`

Calculates the negative binomial probability mass function, where k is the number of failures until r successes occur.

Example Usage:

from statstoolkit.probability import nbinpdf

k = 4  # number of failures
r = 2  # number of successes
p = 0.5  # probability of success
print(nbinpdf(k, r, p))  # Output: probability of exactly 4 failures before 2 successes

5. `hygepdf(k, M, n, N)`

Calculates the hypergeometric probability mass function for the probability of drawing k successes from a population of M, with n successes in the sample.

Example Usage:

from statstoolkit.probability import hygepdf

k = 3  # successes in sample
M = 50  # population size
n = 20  # sample size
N = 10  # number of successes in population
print(hygepdf(k, M, n, N))  # Output: probability of drawing 3 successes

6. `betapdf(x, a, b)`

Calculates the beta probability density function.

Example Usage:

from statstoolkit.probability import betapdf

x = 0.5
a = 2  # shape parameter alpha
b = 2  # shape parameter beta
print(betapdf(x, a, b))  # Output: beta distribution density at x=0.5

7. `chi2pdf(x, df)`

Calculates the chi-squared probability density function.

Example Usage:

from statstoolkit.probability import chi2pdf

x = 5
df = 3  # degrees of freedom
print(chi2pdf(x, df))  # Output: chi-squared density at x=5

8. `exppdf(x, scale)`

Calculates the exponential probability density function.

Example Usage:

from statstoolkit.probability import exppdf

x = 2
scale = 1  # inverse of rate parameter lambda
print(exppdf(x, scale=scale))  # Output: exponential density at x=2

9. `fpdf(x, dfn, dfd)`

Calculates the F-distribution probability density function.

Example Usage:

from statstoolkit.probability import fpdf

x = 1.5
dfn = 5  # degrees of freedom numerator
dfd = 2  # degrees of freedom denominator
print(fpdf(x, dfn, dfd))  # Output: F-distribution density at x=1.5

10. `normpdf(x, mu, sigma)`

Calculates the normal (Gaussian) probability density function.

Example Usage:

from statstoolkit.probability import normpdf

x = 0
mu = 0  # mean
sigma = 1  # standard deviation
print(normpdf(x, mu, sigma))  # Output: normal density at x=0

11. `lognpdf(x, s, scale)`

Calculates the log-normal probability density function.

Example Usage:

from statstoolkit.probability import lognpdf

x = 1.5
s = 0.5  # shape parameter
scale = 1  # scale parameter
print(lognpdf(x, s, scale=scale))  # Output: log-normal density at x=1.5

12. `tpdf(x, df)`

Calculates the Student's t probability density function.

Example Usage:

from statstoolkit.probability import tpdf

x = 2
df = 10  # degrees of freedom
print(tpdf(x, df))  # Output: t-distribution density at x=2

13. `wblpdf(x, c, scale)`

Calculates the Weibull probability density function.

Example Usage:

from statstoolkit.probability import wblpdf

x = 1.2
c = 2  # shape parameter
scale = 1  # scale parameter
print(wblpdf(x, c, scale=scale))  # Output: Weibull density at x=1.2

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.2

Oct 28, 2024

1.0.1

Oct 28, 2024

1.0.0

Oct 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statsToolkit-1.0.2.tar.gz (30.6 kB view details)

Uploaded Oct 28, 2024 Source

Built Distribution

statsToolkit-1.0.2-py3-none-any.whl (31.0 kB view details)

Uploaded Oct 28, 2024 Python 3

File details

Details for the file statsToolkit-1.0.2.tar.gz.

File metadata

Download URL: statsToolkit-1.0.2.tar.gz
Upload date: Oct 28, 2024
Size: 30.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for statsToolkit-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`fd58ecf9bceff5202d25d50cd908667d42d2e0c7fa8646a976a8219a5a71ff54`
MD5	`59eb873f8dfaadecb053bc6085eb7285`
BLAKE2b-256	`5c1537dd197a01d2a04b532b889c162b1cd86e774cad7cadfd9946756fe0cc04`

See more details on using hashes here.

File details

Details for the file statsToolkit-1.0.2-py3-none-any.whl.

File metadata

Download URL: statsToolkit-1.0.2-py3-none-any.whl
Upload date: Oct 28, 2024
Size: 31.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for statsToolkit-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b76e9b6a7c24da459c66f486ef2ed1df8ac8fd7a755877b7cb5877601c350eb6`
MD5	`f7c1f0960aeee147506af697869e03e6`
BLAKE2b-256	`e055c3f078acf734eb28a7b748c17fef57683d0e16ed6bf082dc15c657227764`

See more details on using hashes here.

statsToolkit 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Statistical Methods for Data Science (statsToolkit)

Descriptive Statistics

Functions Overview

1. mean(X)

Example Usage:

2. median(X)

Example Usage:

3. range_(X)

Example Usage:

4. var(X, ddof=0)

Example Usage:

5. std(A, w=0, dim=None, vecdim=None, missingflag=None)

Key Differences Between std Implementation and MATLAB's std

Comparison Table of std Results Between MATLAB and Python Implementations

Example Usage:

6. quantile(X, Q)

Example Usage:

7. corrcoef(x, y, alternative='two-sided', method=None)

Example Usage:

8. partialcorr(X, columns=None)

Example Usage:

9. cov(data1, data2=None)

Example Usage:

10. fitlm(x, y)

Example Usage:

11. anova(y=None, factors=None, data=None, formula=None, response=None, sum_of_squares='type I')

Example Usage:

12. kruskalwallis(x, group=None, displayopt=False)

Example Usage:

Visualization Functions

Functions Overview

1. bar_chart(x, y, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Example Usage:

2. pie_chart(sizes, labels=None, title=None, colors=None, explode=None, autopct='%1.1f%%', shadow=False, startangle=90, **kwargs)

Example Usage:

3. histogram(x, bins=10, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Example Usage:

4. boxplot(MPG, origin=None, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Example Usage:

5. scatterplot(x, y, z=None, symbol='o', title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)

Example Usage (2D Scatter Plot):

Example Usage (3D-like Scatter Plot):

Probability

1. binopdf(k, n, p)

Example Usage:

2. poisspdf(k, mu)

Example Usage:

3. geopdf(k, p)

Example Usage:

4. nbinpdf(k, r, p)

Example Usage:

5. hygepdf(k, M, n, N)

Example Usage:

6. betapdf(x, a, b)

Example Usage:

7. chi2pdf(x, df)

Example Usage:

8. exppdf(x, scale)

Example Usage:

9. fpdf(x, dfn, dfd)

Example Usage:

10. normpdf(x, mu, sigma)

Example Usage:

11. lognpdf(x, s, scale)

Example Usage:

12. tpdf(x, df)

Example Usage:

13. wblpdf(x, c, scale)

Example Usage:

License

Project details

Verified details

1. `mean(X)`

2. `median(X)`

3. `range_(X)`

4. `var(X, ddof=0)`

5. `std(A, w=0, dim=None, vecdim=None, missingflag=None)`

Key Differences Between `std` Implementation and MATLAB's `std`

Comparison Table of `std` Results Between MATLAB and Python Implementations

6. `quantile(X, Q)`

7. `corrcoef(x, y, alternative='two-sided', method=None)`

8. `partialcorr(X, columns=None)`

9. `cov(data1, data2=None)`

10. `fitlm(x, y)`

11. `anova(y=None, factors=None, data=None, formula=None, response=None, sum_of_squares='type I')`

12. `kruskalwallis(x, group=None, displayopt=False)`

1. `bar_chart(x, y, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

2. `pie_chart(sizes, labels=None, title=None, colors=None, explode=None, autopct='%1.1f%%', shadow=False, startangle=90, **kwargs)`

3. `histogram(x, bins=10, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

4. `boxplot(MPG, origin=None, title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

5. `scatterplot(x, y, z=None, symbol='o', title=None, xlabel=None, ylabel=None, color=None, figsize=(10, 6), **kwargs)`

1. `binopdf(k, n, p)`

2. `poisspdf(k, mu)`

3. `geopdf(k, p)`

4. `nbinpdf(k, r, p)`

5. `hygepdf(k, M, n, N)`

6. `betapdf(x, a, b)`

7. `chi2pdf(x, df)`

8. `exppdf(x, scale)`

9. `fpdf(x, dfn, dfd)`

10. `normpdf(x, mu, sigma)`

11. `lognpdf(x, s, scale)`

12. `tpdf(x, df)`

13. `wblpdf(x, c, scale)`