Skip to main content

A collection tools in Python for data collection and processing, statistics, analytics, and intelligence analysis.

Project description

Analysis Tool Box

Description

Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.

Getting Started

To install the package, run the following command in the root directory of the project:

pip install analysistoolbox

Visualizations are created using the matplotlib and seaborn libraries. While you can select whichever seaborn style you'd like, the following Seaborn style tends to get the best looking plots:

sns.set(
    style="white",
    font="Arial",
    context="paper"
)

Documentation Standards

To verify that the library remains accessible to students and professionals alike, all functions in the library should include a Teaching Note in their docstring. This note should explain why the function matters, how it aids in "sense-making," or what qualitative insights it reveals, rather than just describing the mechanics of the code.

See analysistoolbox.calculus.PlotFunction for a canonical example.

Table of Contents / Usage

There are many modules in the analysistoolbox package, each with their own functions. The following is a list of the modules:

Calculus

FindDerivative

The FindDerivative function calculates the derivative of a given function. It uses the sympy library, a Python library for symbolic mathematics, to perform the differentiation. The function also has the capability to print the original function and its derivative, return the derivative function, and plot both the original function and its derivative.

# Load the FindDerivative function from the Calculus submodule
from analysistoolbox.calculus import FindDerivative
import sympy

# Define a symbolic variable
x = sympy.symbols('x')

# Define a function
f_of_x = x**3 + 2*x**2 + 3*x + 4

# Use the FindDerivative function
FindDerivative(
    f_of_x, 
    print_functions=True, 
    return_derivative_function=True, 
    plot_functions=True
)

FindLimitOfFunction

The FindLimitOfFunction function finds the limit of a function at a specific point and optionally plot the function and its tangent line at that point. The script uses the matplotlib and numpy libraries for plotting and numerical operations respectively.

# Import the necessary libraries
from analysistoolbox.calculus import FindLimitOfFunction
import numpy as np
import sympy

# Define a symbolic variable
x = sympy.symbols('x')

# Define a function
f_of_x = np.sin(x) / x

# Use the FindLimitOfFunction function
FindLimitOfFunction(
    f_of_x, 
    point=0, 
    step=0.01, 
    plot_function=True, 
    x_minimum=-10, 
    x_maximum=10, 
    n=1000, 
    tangent_line_window=1
)

FindMinimumSquareLoss

The FindMinimumSquareLoss function calculates the minimum square loss between observed and predicted values. This function is often used in machine learning and statistics to measure the average squared difference between the actual and predicted outcomes.

# Import the necessary libraries
from analysistoolbox.calculus import FindMinimumSquareLoss

# Define observed and predicted values
observed_values = [1, 2, 3, 4, 5]
predicted_values = [1.1, 1.9, 3.2, 3.7, 5.1]

# Use the FindMinimumSquareLoss function
minimum_square_loss = FindMinimumSquareLoss(
    observed_values, 
    predicted_values, 
    show_plot=True
)

# Print the minimum square loss
print(f"The minimum square loss is: {minimum_square_loss}")

PlotFunction

The PlotFunction function plots a mathematical function of x. It takes a lambda function as input and allows for customization of the plot.

# Import the necessary libraries
from analysistoolbox.calculus import PlotFunction
import sympy

# Set x as a symbolic variable
x = sympy.symbols('x')

# Define the function to plot
f_of_x = lambda x: x**2

# Plot the function with default settings
PlotFunction(f_of_x)

Data Collection

ExtractTextFromPDF

The ExtractTextFromPDF function extracts text from a PDF file, cleans it, then saves it to a text file.

# Import the function
from analysistoolbox.data_collection import ExtractTextFromPDF

# Call the function
ExtractTextFromPDF(
    filepath_to_pdf="/path/to/your/input.pdf", 
    filepath_for_exported_text="/path/to/your/output.txt", 
    start_page=1, 
    end_page=None
)

FetchPDFFromURL

The FetchPDFFromURL function downloads a PDF file from a URL and saves it to a specified location.

# Import the function
from analysistoolbox.data_collection import FetchPDFFromURL

# Call the function to download the PDF
FetchPDFFromURL(
    url="https://example.com/sample.pdf", 
    filename="C:/folder/sample.pdf"
)

FetchUSShapefile

The FetchUSShapefile function fetches a geographical shapefile from the TIGER database of the U.S. Census Bureau.

# Import the function
from analysistoolbox.data_collection import FetchUSShapefile

# Fetch the shapefile for the census tracts in King County, Washington, for the 2021 census year
shapefile = FetchUSShapefile(
    state='PA', 
    county='Allegheny', 
    geography='tract', 
    census_year=2021
)

# Print the first few rows of the shapefile
print(shapefile.head())

FetchWebsiteText

The FetchWebsiteText function fetches the text from a website and saves it to a text file.

# Import the function
from analysistoolbox.data_collection import FetchWebsiteText

# Call the function
text = FetchWebsiteText(
    url="https://www.example.com", 
    browserless_api_key="your_browserless_api_key"
)

# Print the fetched text
print(text)

GetCompanyFilings

The GetCompanyFilings function fetches company filings from the SEC EDGAR database. It returns a list of filings for a given company CIK (Central Index Key) and filing type.

# Import the function
from analysistoolbox.data_collection import GetCompanyFilings

# Call the function to get company filings for 'Online Dating' companies in 2024
results = GetCompanyFilings(
        search_keywords="Online Dating",
        start_date="2024-01-01",
        end_date="2024-12-31",
        filing_type="all",
    )

# Print the results
print(results)

GetGoogleSearchResults

The GetGoogleSearchResults function fetches Google search results for a given query using the Serper API.

# Import the function
from analysistoolbox.data_collection import GetGoogleSearchResults

# Call the function with the query
# Make sure to replace 'your_serper_api_key' with your actual Serper API key
results = GetGoogleSearchResults(
    query="Python programming", 
    serper_api_key='your_serper_api_key', 
    number_of_results=5, 
    apply_autocorrect=True, 
    display_results=True
)

# Print the results
print(results)

GetZipFile

The GetZipFile function downloads a zip file from a url and saves it to a specified folder. It can also unzip the file and print the contents of the zip file.

# Import the function
from analysistoolbox.data_collection import GetZipFile

# Call the function
GetZipFile(
    url="http://example.com/file.zip", 
    path_to_save_folder="/path/to/save/folder"
)

Data Processing

AddDateNumberColumns

The AddDateNumberColumns function adds columns for the year, month, quarter, week, day of the month, and day of the week to a dataframe.

# Import necessary packages
from analysistoolbox.data_processing import AddDateNumberColumns
from datetime import datetime
import pandas as pd

# Create a sample dataframe
data = {'Date': [datetime(2020, 1, 1), datetime(2020, 2, 1), datetime(2020, 3, 1), datetime(2020, 4, 1)]}
df = pd.DataFrame(data)

# Use the function on the sample dataframe
df = AddDateNumberColumns(
    dataframe=df, 
    date_column_name='Date'
)

# Print the updated dataframe
print(df)

AddLeadingZeros

The AddLeadingZeros function adds leading zeros to a column. If fixed_length is not specified, the longest string in the column is used as the fixed length. If add_as_new_column is set to True, the new column is added to the dataframe. Otherwise, the original column is updated.

# Import necessary packages
from analysistoolbox.data_processing import AddLeadingZeros
import pandas as pd

# Create a sample dataframe
data = {'ID': [1, 23, 456, 7890]}
df = pd.DataFrame(data)

# Use the AddLeadingZeros function
df = AddLeadingZeros(
    dataframe=df, 
    column_name='ID', 
    add_as_new_column=True
)

# Print updated dataframe
print(df)

AddRowCountColumn

The AddRowCountColumn function adds a column to a dataframe that contains the row number for each row, based on a group (or groups) of columns. The function can also sort the dataframe by a column or columns before adding the row count column.

# Import necessary packages
from analysistoolbox.data_processing import AddRowCountColumn
import pandas as pd

# Create a sample dataframe
data = {
    'Payment Method': ['Check', 'Credit Card', 'Check', 'Credit Card', 'Check', 'Credit Card', 'Check', 'Credit Card'],
    'Transaction Value': [100, 200, 300, 400, 500, 600, 700, 800],
    'Transaction Order': [1, 2, 3, 4, 5, 6, 7, 8]
}
df = pd.DataFrame(data)

# Call the function
df_updated = AddRowCountColumn(
    dataframe=df, 
    list_of_grouping_variables=['Payment Method'], 
    list_of_order_columns=['Transaction Order'], 
    list_of_ascending_order_args=[True]
)

# Print the updated dataframe
print(df_updated)

AddTPeriodColumn

The AddTPeriodColumn function adds a T-period column to a dataframe. The T-period column is the number of intervals (e.g., days or weeks) since the earliest date in the dataframe.

# Import necessary libraries
from analysistoolbox.data_processing import AddTPeriodColumn
from datetime import datetime
import pandas as pd

# Create a sample dataframe
data = {
    'date': pd.date_range(start='1/1/2020', end='1/10/2020'),
    'value': range(1, 11)
}
df = pd.DataFrame(data)

# Use the function
df_updated = AddTPeriodColumn(
    dataframe=df, 
    date_column_name='date', 
    t_period_interval='days'
)

# Print the updated dataframe
print(df_updated)

AddTukeyOutlierColumn

The AddTukeyOutlierColumn function adds a column to a dataframe that indicates whether a value is an outlier. The function uses the Tukey method to identify outliers.

# Import necessary libraries
from analysistoolbox.data_processing import AddTukeyOutlierColumn
import pandas as pd

# Create a sample dataframe
data = pd.DataFrame({'values': [1, 2, 3, 4, 5, 6, 7, 8, 9, 20]})

# Use the function
df_updated = AddTukeyOutlierColumn(
    dataframe=data, 
    value_column_name='values', 
    tukey_boundary_multiplier=1.5, 
    plot_tukey_outliers=True
)

# Print the updated dataframe
print(df_updated)

CleanTextColumns

The CleanTextColumns function cleans string-type columns in a pandas DataFrame by removing all leading and trailing spaces.

# Import necessary libraries
from analysistoolbox.data_processing import CleanTextColumns
import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({
    'A': [' hello', 'world ', ' python '],
    'B': [1, 2, 3],
})

# Clean the dataframe
df_clean = CleanTextColumns(df)

ConductAnomalyDetection

The ConductAnomalyDetection function performs anomaly detection on a given dataset using the z-score method.

# Import necessary libraries
from analysistoolbox.data_processing import ConductAnomalyDetection
import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 1000],
    'B': [4, 5, 6, 2000],
})

# Conduct anomaly detection
df_anomaly_detected = ConductAnomalyDetection(
    dataframe=df, 
    list_of_columns_to_analyze=['A', 'B']
)

# Print the updated dataframe
print(df_anomaly_detected)

ConductEntityMatching

The ConductEntityMatching function performs entity matching between two dataframes using various fuzzy matching algorithms.

from analysistoolbox.data_processing import ConductEntityMatching
import pandas as pd

# Create two dataframes
dataframe_1 = pd.DataFrame({
    'ID': ['1', '2', '3'],
    'Name': ['John Doe', 'Jane Smith', 'Bob Johnson'],
    'City': ['New York', 'Los Angeles', 'Chicago']
})

dataframe_2 = pd.DataFrame({
    'ID': ['a', 'b', 'c'],
    'Name': ['Jon Doe', 'Jane Smyth', 'Robert Johnson'],
    'City': ['NYC', 'LA', 'Chicago']
})

# Conduct entity matching
matched_entities = ConductEntityMatching(
    dataframe_1=dataframe_1,
    dataframe_1_primary_key='ID',
    dataframe_2=dataframe_2,
    dataframe_2_primary_key='ID',
    levenshtein_distance_filter=3,
    match_score_threshold=80,
    columns_to_compare=['Name', 'City'],
    match_methods=['Partial Token Set Ratio', 'Weighted Ratio']
)

ConvertOddsToProbability

The ConvertOddsToProbability function converts odds to probability in a new column.

# Import necessary packages
from analysistoolbox.data_processing import ConvertOddsToProbability
import pandas as pd

# Create a sample dataframe
data = {
    'Team': ['Team1', 'Team2', 'Team3', 'Team4'],
    'Odds': [2.5, 1.5, 3.0, np.nan]
}
df = pd.DataFrame(data)

# Print the original dataframe
print("Original DataFrame:")
print(df)

# Use the function to convert odds to probability
df = ConvertOddsToProbability(
    dataframe=df, 
    odds_column='Odds'
)

CountMissingDataByGroup

The CountMissingDataByGroup function counts the number of records with missing data in a Pandas dataframe, grouped by specified columns.

# Import necessary packages
from analysistoolbox.data_processing import CountMissingDataByGroup
import pandas as pd
import numpy as np

# Create a sample dataframe with some missing values
data = {
    'Group': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value1': [1, 2, np.nan, 4, 5, np.nan],
    'Value2': [np.nan, 8, 9, 10, np.nan, 12]
}
df = pd.DataFrame(data)

# Use the function to count missing data by group
CountMissingDataByGroup(
    dataframe=df, 
    list_of_grouping_columns=['Group']
)

CreateBinnedColumn

The CreateBinnedColumn function creates a new column in a Pandas dataframe based on a numeric variable. Binning is a process of transforming continuous numerical variables into discrete categorical 'bins'.

# Import necessary packages
from analysistoolbox.data_processing import CreateBinnedColumn
import pandas as pd
import numpy as np

# Create a sample dataframe
data = {
    'Group': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value1': [1, 2, 3, 4, 5, 6],
    'Value2': [7, 8, 9, 10, 11, 12]
}
df = pd.DataFrame(data)

# Use the function to create a binned column
df_binned = CreateBinnedColumn(
    dataframe=df, 
    numeric_column_name='Value1', 
    number_of_bins=3, 
    binning_strategy='uniform'
)

CreateDataOverview

The CreateDataOverview function creates an overview of a Pandas dataframe, including the data type, missing count, missing percentage, and summary statistics for each variable in the DataFrame.

# Import necessary packages
from analysistoolbox.data_processing import CreateDataOverview
import pandas as pd
import numpy as np

# Create a sample dataframe
data = {
    'Column1': [1, 2, 3, np.nan, 5, 6],
    'Column2': ['a', 'b', 'c', 'd', np.nan, 'f'],
    'Column3': [7.1, 8.2, 9.3, 10.4, np.nan, 12.5]
}
df = pd.DataFrame(data)

# Use the function to create an overview of the dataframe
CreateDataOverview(
    dataframe=df, 
    plot_missingness=True
)

CreateRandomSampleGroups

The CreateRandomSampleGroups function a takes a pandas DataFrame, shuffle its rows, assign each row to one of n groups, and then return the updated DataFrame with an additional column indicating the group number.

# Import necessary packages
from analysistoolbox.data_processing import CreateRandomSampleGroups 
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 31, 35, 19, 45],
    'Score': [85, 95, 78, 81, 92]
}
df = pd.DataFrame(data)

# Use the function
grouped_df = CreateRandomSampleGroups(
    dataframe=df, 
    number_of_groups=2, 
    random_seed=123
)

CreateRareCategoryColumn

The CreateRareCategoryColumn function creates a new column in a Pandas dataframe that indicates whether a categorical variable value is rare. A rare category is a category that occurs less than a specified percentage of the time.

# Import necessary packages
from analysistoolbox.data_processing import CreateRareCategoryColumn 
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'],
    'Age': [25, 31, 35, 19, 45, 23, 30, 24],
    'Score': [85, 95, 78, 81, 92, 88, 90, 86]
}
df = pd.DataFrame(data)

# Use the function
updated_df = CreateRareCategoryColumn(
    dataframe=df, 
    categorical_column_name='Name', 
    rare_category_label='Rare', 
    rare_category_threshold=0.05,
    new_column_suffix='(rare category)'
)

CreateStratifiedRandomSampleGroups

The CreateStratifiedRandomSampleGroups unction performs stratified random sampling on a pandas DataFrame. Stratified random sampling is a method of sampling that involves the division of a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members' shared attributes or characteristics.

# Import necessary packages
from analysistoolbox.data_processing import CreateStratifiedRandomSampleGroups
import numpy as np
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'],
    'Age': [25, 31, 35, 19, 45, 23, 30, 24],
    'Score': [85, 95, 78, 81, 92, 88, 90, 86]
}
df = pd.DataFrame(data)

# Use the function
stratified_df = CreateStratifiedRandomSampleGroups(
    dataframe=df, 
    number_of_groups=2, 
    list_categorical_column_names=['Name'], 
    random_seed=42
)

ImputeMissingValuesUsingNearestNeighbors

The ImputeMissingValuesUsingNearestNeighbors function imputes missing values in a dataframe using the nearest neighbors method. For each sample with missing values, it finds the n_neighbors nearest neighbors in the training set and imputes the missing values using the mean value of these neighbors.

# Import necessary packages
from analysistoolbox.data_processing import ImputeMissingValuesUsingNearestNeighbors
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, np.nan]
}
df = pd.DataFrame(data)

# Use the function
imputed_df = ImputeMissingValuesUsingNearestNeighbors(
    dataframe=df, 
    list_of_numeric_columns_to_impute=['A', 'B', 'C', 'D'], 
    number_of_neighbors=2, 
    averaging_method='uniform'
)

VerifyGranularity

The VerifyGranularity function checks the granularity of a given dataframe based on a list of key columns. Granularity in this context refers to the level of detail or summarization in a set of data.

# Import necessary packages
from analysistoolbox.data_processing import VerifyGranularity
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'],
    'Age': [25, 31, 35, 19, 45, 23, 30, 24],
    'Score': [85, 95, 78, 81, 92, 88, 90, 86]
}
df = pd.DataFrame(data)

# Use the function
VerifyGranularity(
    dataframe=df, 
    list_of_key_columns=['Name', 'Age'], 
    set_key_as_index=True, 
    print_as_markdown=False
)

Descriptive Analytics

ConductManifoldLearning

The ConductManifoldLearning function performs manifold learning on a given dataframe and returns a new dataframe with the original columns and the new manifold learning components. Manifold learning is a type of unsupervised learning that is used to reduce the dimensionality of the data.

# Import necessary packages
from analysistoolbox.descriptive_analytics import ConductManifoldLearning
import pandas as pd
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Use the function
new_df = ConductManifoldLearning(
    dataframe=iris_df, 
    list_of_numeric_columns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 
    number_of_components=2, 
    random_seed=42, 
    show_component_summary_plots=True, 
    sns_color_palette='Set2',
    summary_plot_size=(10, 10)
)

ConductPrincipalComponentAnalysis

The ConductPrincipalComponentAnalysis function performs Principal Component Analysis (PCA) on a given dataframe. PCA is a technique used in machine learning to reduce the dimensionality of data while retaining as much information as possible.

# Import necessary packages
from analysistoolbox.descriptive_analytics import ConductManifoldLearning
import pandas as pd
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Call the function
result = ConductPrincipalComponentAnalysis(
    dataframe=iris_df,
    list_of_numeric_columns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
    number_of_components=2
)

ConductPropensityScoreMatching

Conducts propensity score matching to create balanced treatment and control groups for causal inference analysis.

from analysistoolbox.descriptive_analytics import ConductPropensityScoreMatching
import pandas as pd

# Create matched groups based on age, education, and experience
matched_df = ConductPropensityScoreMatching(
    dataframe=df,
    subject_id_column_name='employee_id',
    list_of_column_names_to_base_matching=['age', 'education', 'years_experience'],
    grouping_column_name='received_training',
    control_group_name='No',
    max_matches_per_subject=1,
    balance_groups=True,
    propensity_score_column_name="PS_Score",
    matched_id_column_name="Matched_Employee_ID",
    random_seed=412
)

CreateAssociationRules

The CreateAssociationRules function creates association rules from a given dataframe. Association rules are widely used in market basket analysis, where the goal is to find associations and/or correlations among a set of items.

# Import necessary packages
from analysistoolbox.descriptive_analytics import CreateAssociationRules
import pandas as pd

# Assuming you have a dataframe 'df' with 'TransactionID' and 'Item' columns
result = CreateAssociationRules(
    dataframe=df,
    transaction_id_column='TransactionID',
    items_column='Item',
    support_threshold=0.01,
    confidence_threshold=0.2,
    plot_lift=True,
    plot_title='Association Rules',
    plot_size=(10, 7)
)

CreateGaussianMixtureClusters

The CreateGaussianMixtureClusters function creates Gaussian mixture clusters from a given dataframe. Gaussian mixture models are a type of unsupervised learning that is used to find clusters in data. It adds the resulting clusters as a new column in the dataframe, and also calculates the probability of each data point belonging to each cluster.

# Import necessary packages
from analysistoolbox.descriptive_analytics import CreateGaussianMixtureClusters
import pandas as pd
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()

# Convert the iris dataset to a pandas dataframe
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                  columns= iris['feature_names'] + ['target'])

# Call the CreateGaussianMixtureClusters function
df_clustered = CreateGaussianMixtureClusters(
    dataframe=df,
    list_of_numeric_columns_for_clustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
    number_of_clusters=3,
    column_name_for_clusters='Gaussian Mixture Cluster',
    scale_predictor_variables=True,
    show_cluster_summary_plots=True,
    sns_color_palette='Set2',
    summary_plot_size=(15, 15),
    random_seed=123,
    maximum_iterations=200
)

CreateHierarchicalClusters

The CreateHierarchicalClusters function creates hierarchical clusters from a given dataframe. Hierarchical clustering is a type of unsupervised learning that is used to find clusters in data. It adds the resulting clusters as a new column in the dataframe.

# Import necessary packages
from analysistoolbox.descriptive_analytics import CreateHierarchicalClusters
import pandas as pd
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Call the CreateHierarchicalClusters function
df_clustered = CreateHierarchicalClusters(
    dataframe=df,
    list_of_value_columns_for_clustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
    number_of_clusters=3,
    column_name_for_clusters='Hierarchical Cluster',
    scale_predictor_variables=True,
    show_cluster_summary_plots=True,
    color_palette='Set2',
    summary_plot_size=(6, 4),
    random_seed=412,
    maximum_iterations=300
)

CreateKMeansClusters

The CreateKMeansClusters function performs K-Means clustering on a given dataset and returns the dataset with an additional column indicating the cluster each record belongs to.

# Import necessary packages
from analysistoolbox.descriptive_analytics import CreateKMeansClusters
import pandas as pd
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Call the CreateKMeansClusters function
df_clustered = CreateKMeansClusters(
    dataframe=df,
    list_of_value_columns_for_clustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
    number_of_clusters=3,
    column_name_for_clusters='K-Means Cluster',
    scale_predictor_variables=True,
    show_cluster_summary_plots=True,
    color_palette='Set2',
    summary_plot_size=(6, 4),
    random_seed=412,
    maximum_iterations=300
)

GenerateEDAWithLIDA

The GenerateEDAWithLIDA function uses the LIDA package from Microsoft to generate exploratory data analysis (EDA) goals.

# Import necessary packages
from analysistoolbox.descriptive_analytics import GenerateEDAWithLIDA
import pandas as pd
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Call the GenerateEDAWithLIDA function
df_summary = GenerateEDAWithLIDA(
    dataframe=df,
    llm_api_key="your_llm_api_key_here",
    llm_provider="openai",
    llm_model="gpt-3.5-turbo",
    visualization_library="seaborn",
    goal_temperature=0.50,
    code_generation_temperature=0.05,
    data_summary_method="llm",
    number_of_samples_to_show_in_summary=5,
    return_data_fields_summary=True,
    number_of_goals_to_generate=5,
    plot_recommended_visualization=True,
    show_code_for_recommended_visualization=True
)

File Management

ImportDataFromFolder

The ImportDataFromFolder function imports all CSV and Excel files from a specified folder and combines them into a single DataFrame. It ensures that column names match across all files if specified.

# Import necessary packages
from analysistoolbox.file_management import ImportDataFromFolder

# Specify the folder path
folder_path = "path/to/your/folder"

# Call the ImportDataFromFolder function
combined_df = ImportDataFromFolder(
    folder_path=folder_path,
    force_column_names_to_match=True
)

CreateFileTree

The CreateFileTree function recursively walks a directory tree and prints a diagram of all the subdirectories and files.

# Import necessary packages
from analysistoolbox.file_management import CreateFileTree

# Specify the directory path
directory_path = "path/to/your/directory"

# Call the CreateFileTree function
CreateFileTree(
    path=directory_path,
    indent_spaces=2
)

CreateCopyOfPDF

The CreateCopyOfPDF function creates a copy of a PDF file, with options to specify the start and end pages.

# Import necessary packages
from analysistoolbox.file_management import CreateCopyOfPDF

# Specify the input and output file paths
input_pdf = "path/to/input.pdf"
output_pdf = "path/to/output.pdf"

# Call the CreateCopyOfPDF function
CreateCopyOfPDF(
    input_file=input_pdf,
    output_file=output_pdf,
    start_page=1,
    end_page=5
)

ConvertWordDocsToPDF

The ConvertWordDocsToPDF function converts all Word documents in a specified folder to PDF format.

# Import necessary packages
from analysistoolbox.file_management import ConvertWordDocsToPDF

# Specify the folder paths
word_folder = "path/to/word/documents"
pdf_folder = "path/to/save/pdf/documents"

# Call the ConvertWordDocsToPDF function
ConvertWordDocsToPDF(
    word_folder_path=word_folder,
    pdf_folder_path=pdf_folder,
    open_each_doc=False
)

Geospatial Analysis

ConductClusterAnalysis

The ConductClusterAnalysis function groups geospatial data points into clusters based on their proximity using the DBSCAN algorithm. It mimics the functionality of Bellingcat's geoclustering tool and can optionally display a map of the clusters.

# Import the function
from analysistoolbox.geospatial_analysis import ConductClusterAnalysis
import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({
    'lat': [40.7128, 40.7130, 34.0522, 34.0525],
    'lon': [-74.0060, -74.0065, -118.2437, -118.2440]
})

# Run cluster analysis
summary = ConductClusterAnalysis(
    dataframe=df,
    longitude_column='lon',
    latitude_column='lat',
    distance_km=0.5,
    min_samples=2,
    map_clusters=True
)

# Print summary
print(summary)

ConductSpatialAutocorrelation

The ConductSpatialAutocorrelation function identifies spatial clusters and outliers using Local Moran's I. Spatial autocorrelation measures how much a variable at a location is correlated with the same variable at neighboring locations, effectively identifying if similar values cluster together (hotspots/coldspots) or if dissimilar values are near each other (spatial outliers).

# Import the function
from analysistoolbox.geospatial_analysis import ConductSpatialAutocorrelation
import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({
    'lat': [40.7128, 40.7130, 34.0522, 34.0525, 34.0500],
    'lon': [-74.0060, -74.0065, -118.2437, -118.2440, -118.2500],
    'value': [100, 105, 10, 15, 8]
})

# Run spatial autocorrelation analysis
results = ConductSpatialAutocorrelation(
    dataframe=df,
    longitude_column='lon',
    latitude_column='lat',
    value_column='value',
    k=5,
    plot=True
)

# Print results
print(results[['lat', 'lon', 'value', 'cluster_category']])

Hypothesis Testing

ChiSquareTestOfIndependence

The ChiSquareTestOfIndependence function performs a chi-square test of independence to determine if there is a significant relationship between two categorical variables.

from analysistoolbox.hypothesis_testing import ChiSquareTestOfIndependence

# Create sample data
data = {
    'Education': ['High School', 'College', 'High School', 'Graduate', 'College'],
    'Employment': ['Employed', 'Unemployed', 'Employed', 'Employed', 'Unemployed']
}
df = pd.DataFrame(data)

# Conduct chi-square test
ChiSquareTestOfIndependence(
    dataframe=df,
    first_categorical_column='Education',
    second_categorical_column='Employment',
    plot_contingency_table=True
)

ChiSquareTestOfIndependenceFromTable

The ChiSquareTestOfIndependenceFromTable function performs a chi-square test using a pre-computed contingency table.

from analysistoolbox.hypothesis_testing import ChiSquareTestOfIndependenceFromTable

# Create contingency table
contingency_table = pd.DataFrame({
    'Online': [100, 150],
    'In-Store': [200, 175]
}, index=['Male', 'Female'])

# Conduct chi-square test
ChiSquareTestOfIndependenceFromTable(
    contingency_table=contingency_table,
    plot_contingency_table=True
)

ConductCoxProportionalHazardRegression

The ConductCoxProportionalHazardRegression function performs survival analysis using Cox Proportional Hazard regression.

from analysistoolbox.hypothesis_testing import ConductCoxProportionalHazardRegression

# Conduct Cox regression
model = ConductCoxProportionalHazardRegression(
    dataframe=df,
    outcome_column='event',
    duration_column='time',
    list_of_predictor_columns=['age', 'sex', 'treatment'],
    plot_survival_curve=True
)

ConductLinearRegressionAnalysis

The ConductLinearRegressionAnalysis function performs linear regression analysis with optional plotting.

from analysistoolbox.hypothesis_testing import ConductLinearRegressionAnalysis

# Conduct linear regression
results = ConductLinearRegressionAnalysis(
    dataframe=df,
    outcome_column='sales',
    list_of_predictor_columns=['advertising', 'price'],
    plot_regression_diagnostic=True
)

ConductLogisticRegressionAnalysis

The ConductLogisticRegressionAnalysis function performs logistic regression for binary outcomes.

from analysistoolbox.hypothesis_testing import ConductLogisticRegressionAnalysis

# Conduct logistic regression
results = ConductLogisticRegressionAnalysis(
    dataframe=df,
    outcome_column='purchased',
    list_of_predictor_columns=['age', 'income'],
    plot_regression_diagnostic=True
)

OneSampleTTest

The OneSampleTTest function performs a one-sample t-test to compare a sample mean to a hypothesized population mean.

from analysistoolbox.hypothesis_testing import OneSampleTTest

# Conduct one-sample t-test
OneSampleTTest(
    dataframe=df,
    outcome_column='score',
    hypothesized_mean=70,
    alternative_hypothesis='two-sided',
    confidence_interval=0.95
)

OneWayANOVA

The OneWayANOVA function performs a one-way analysis of variance to compare means across multiple groups.

from analysistoolbox.hypothesis_testing import OneWayANOVA

# Conduct one-way ANOVA
OneWayANOVA(
    dataframe=df,
    outcome_column='performance',
    grouping_column='treatment_group',
    plot_sample_distributions=True
)

TTestOfMeanFromStats

The TTestOfMeanFromStats function performs a t-test using summary statistics rather than raw data.

from analysistoolbox.hypothesis_testing import TTestOfMeanFromStats

# Conduct t-test from statistics
TTestOfMeanFromStats(
    sample_mean=75,
    sample_size=30,
    sample_standard_deviation=10,
    hypothesized_mean=70,
    alternative_hypothesis='greater'
)

TTestOfProportionFromStats

The TTestOfProportionFromStats function tests a sample proportion against a hypothesized value.

from analysistoolbox.hypothesis_testing import TTestOfProportionFromStats

# Test proportion from statistics
TTestOfProportionFromStats(
    sample_proportion=0.65,  # 65% proportion
    sample_size=200,         # 200 survey responses
    hypothesized_proportion=0.50,
    alternative_hypothesis='two-sided'
)

TTestOfTwoMeansFromStats

The TTestOfTwoMeansFromStats function compares two means using summary statistics.

from analysistoolbox.hypothesis_testing import TTestOfTwoMeansFromStats

# Compare two means from statistics
TTestOfTwoMeansFromStats(
    first_sample_mean=75,
    first_sample_size=30,
    first_sample_standard_deviation=10,
    second_sample_mean=70,
    second_sample_size=30,
    second_sample_standard_deviation=12
)

TwoSampleTTestOfIndependence

The TwoSampleTTestOfIndependence function performs an independent samples t-test to compare means between two groups.

from analysistoolbox.hypothesis_testing import TwoSampleTTestOfIndependence

# Conduct independent samples t-test
TwoSampleTTestOfIndependence(
    dataframe=df,
    outcome_column='score',
    grouping_column='group',
    alternative_hypothesis='two-sided',
    homogeneity_of_variance=True
)

TwoSampleTTestPaired

The TwoSampleTTestPaired function performs a paired samples t-test for before-after comparisons.

from analysistoolbox.hypothesis_testing import TwoSampleTTestPaired

# Conduct paired samples t-test
TwoSampleTTestPaired(
    dataframe=df,
    first_outcome_column='pre_score',
    second_outcome_column='post_score',
    alternative_hypothesis='greater'
)

Linear Algebra

CalculateEigenvalues

The CalculateEigenvalues function calculates and visualizes the eigenvalues and eigenvectors of a matrix.

from analysistoolbox.linear_algebra import CalculateEigenvalues
import numpy as np

# Create a 2x2 matrix
matrix = np.array([
    [4, -2],
    [1, 1]
])

# Calculate eigenvalues and eigenvectors
CalculateEigenvalues(
    matrix=matrix,
    plot_eigenvectors=True,
    plot_transformation=True
)

ConvertMatrixToRowEchelonForm

The ConvertMatrixToRowEchelonForm function converts a matrix to row echelon form using Gaussian elimination.

from analysistoolbox.linear_algebra import ConvertMatrixToRowEchelonForm
import numpy as np

# Create a matrix
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

# Convert to row echelon form
row_echelon = ConvertMatrixToRowEchelonForm(
    matrix=matrix,
    show_pivot_columns=True
)

ConvertSystemOfEquationsToMatrix

The ConvertSystemOfEquationsToMatrix function converts a system of linear equations to matrix form.

from analysistoolbox.linear_algebra import ConvertSystemOfEquationsToMatrix
import numpy as np

# Define system of equations: 
# 2x + 3y = 8
# 4x - y = 1
coefficients = np.array([
    [2, 3],
    [4, -1]
])
constants = np.array([8, 1])

# Convert to matrix form
matrix = ConvertSystemOfEquationsToMatrix(
    coefficients=coefficients,
    constants=constants,
    show_determinant=True
)

PlotVectors

The PlotVectors function visualizes vectors in 2D or 3D space.

from analysistoolbox.linear_algebra import PlotVectors
import numpy as np

# Define vectors
vectors = [
    [3, 2],    # First vector
    [-1, 4],   # Second vector
    [2, -3]    # Third vector
]

# Plot vectors
PlotVectors(
    list_of_vectors=vectors,
    origin=[0, 0],
    plot_sum=True,
    grid=True
)

SolveSystemOfEquations

The SolveSystemOfEquations function solves a system of linear equations and optionally visualizes the solution.

from analysistoolbox.linear_algebra import SolveSystemOfEquations
import numpy as np

# Define system of equations:
# 2x + y = 5
# x - 3y = -1
coefficients = np.array([
    [2, 1],
    [1, -3]
])
constants = np.array([5, -1])

# Solve the system
solution = SolveSystemOfEquations(
    coefficients=coefficients,
    constants=constants,
    show_plot=True,
    plot_boundary=10
)

VisualizeMatrixAsLinearTransformation

The VisualizeMatrixAsLinearTransformation function visualizes how a matrix transforms space as a linear transformation.

from analysistoolbox.linear_algebra import VisualizeMatrixAsLinearTransformation
import numpy as np

# Define transformation matrix
transformation_matrix = np.array([
    [2, -1],
    [1, 1]
])

# Visualize the transformation
VisualizeMatrixAsLinearTransformation(
    transformation_matrix=transformation_matrix,
    plot_grid=True,
    plot_unit_vectors=True,
    animation_frames=30
)

LLM

SendPromptToAnthropic

The SendPromptToAnthropic function sends a prompt to Anthropic's Claude API using LangChain. It supports template-based prompting and requires an Anthropic API key.

from analysistoolbox.llm import SendPromptToAnthropic

# Define your prompt template with variables in curly braces
prompt_template = "Given the text: {text}\nSummarize the main points in bullet form."

# Create a dictionary with your input variables
user_input = {
    "text": "Your text to analyze goes here..."
}

# Send the prompt to Claude
response = SendPromptToAnthropic(
    prompt_template=prompt_template,
    user_input=user_input,
    system_message="You are a helpful assistant.",
    anthropic_api_key="your-api-key-here",
    temperature=0.0,
    chat_model_name="claude-3-opus-20240229",
    maximum_tokens=1000
)

print(response)

SendPromptToChatGPT

The SendPromptToChatGPT function sends a prompt to OpenAI's ChatGPT API using LangChain. It supports template-based prompting and requires an OpenAI API key.

from analysistoolbox.llm import SendPromptToChatGPT

# Define your prompt template with variables in curly braces
prompt_template = "Analyze the following data: {data}\nProvide key insights."

# Create a dictionary with your input variables
user_input = {
    "data": "Your data to analyze goes here..."
}

# Send the prompt to ChatGPT
response = SendPromptToChatGPT(
    prompt_template=prompt_template,
    user_input=user_input,
    system_message="You are a helpful assistant.",
    openai_api_key="your-api-key-here",
    temperature=0.0,
    chat_model_name="gpt-4o-mini",
    maximum_tokens=1000
)

print(response)

Predictive Analytics

CreateARIMAModel

Builds an ARIMA (Autoregressive Integrated Moving Average) model for time series forecasting.

from analysistoolbox.predictive_analytics import CreateARIMAModel
import pandas as pd

# Create time series forecast
forecast = CreateARIMAModel(
    dataframe=df,
    time_column='date',
    value_column='sales',
    forecast_periods=12
)

CreateBoostedTreeModel

Creates a gradient boosted tree model for classification or regression tasks, offering high performance and feature importance analysis.

from analysistoolbox.predictive_analytics import CreateBoostedTreeModel

# Train a boosted tree classifier
model = CreateBoostedTreeModel(
    dataframe=df,
    outcome_variable='churn',
    list_of_predictor_variables=['usage', 'tenure', 'satisfaction'],
    is_outcome_categorical=True,
    plot_model_test_performance=True
)

CreateDecisionTreeModel

Builds an interpretable decision tree for classification or regression, with visualization options.

from analysistoolbox.predictive_analytics import CreateDecisionTreeModel

# Create a decision tree for predicting house prices
model = CreateDecisionTreeModel(
    dataframe=df,
    outcome_variable='price',
    list_of_predictor_variables=['sqft', 'bedrooms', 'location'],
    is_outcome_categorical=False,
    maximum_depth=5
)

CreateExponentialSmoothingModel

Builds an exponential smoothing model for time series forecasting.

from analysistoolbox.predictive_analytics import CreateExponentialSmoothingModel

# Create sample time series data
dates = pd.date_range(start='2020-01-01', periods=24, freq='M')
np.random.seed(42)
values = 100 + np.cumsum(np.random.randn(24)) + 5 * np.sin(np.arange(24) * 2 * np.pi / 12)
df = pd.DataFrame({'date': dates, 'sales': values})

# Create an exponential smoothing model for time series forecasting
results = CreateExponentialSmoothingModel(
    dataframe=df,
    time_column='date',
    outcome_column='sales'
)

CreateLinearRegressionModel

Fits a linear regression model with optional scaling and comprehensive performance visualization.

from analysistoolbox.predictive_analytics import CreateLinearRegressionModel

# Predict sales based on advertising spend
model = CreateLinearRegressionModel(
    dataframe=df,
    outcome_variable='sales',
    list_of_predictor_variables=['tv_ads', 'radio_ads', 'newspaper_ads'],
    scale_variables=True,
    plot_model_test_performance=True
)

CreateLogisticRegressionModel

Implements logistic regression for binary classification tasks with regularization options.

from analysistoolbox.predictive_analytics import CreateLogisticRegressionModel

# Predict customer churn probability
model = CreateLogisticRegressionModel(
    dataframe=df,
    outcome_variable='churn',
    list_of_predictor_variables=['usage', 'complaints', 'satisfaction'],
    scale_predictor_variables=True,
    show_classification_plot=True
)

CreateNeuralNetwork_SingleOutcome

Builds and trains a neural network for single-outcome prediction tasks, with customizable architecture.

from analysistoolbox.predictive_analytics import CreateNeuralNetwork_SingleOutcome

# Create a neural network for image classification
model = CreateNeuralNetwork_SingleOutcome(
    dataframe=df,
    outcome_variable='label',
    list_of_predictor_variables=feature_columns,
    number_of_hidden_layers=3,
    is_outcome_categorical=True,
    plot_loss=True
)

Prescriptive Analytics

The prescriptive analytics module empowers users to determine the best course of action through optimization and personalized recommendation systems, bridging the gap between insight and action.

ConductLinearOptimization

Identifies the optimal values for input variables to maximize or minimize a specific output, subject to logical constraints and data-driven linear relationships. This approach is essential for resource allocation, logistics optimization, and cost minimization.

import pandas as pd
from analysistoolbox.prescriptive_analytics.ConductLinearOptimization import ConductLinearOptimization

# Sample data
data = pd.DataFrame({
    'input1': [1, 2, 3, 4, 5],
    'input2': [2, 4, 6, 8, 10],
    'output': [10, 20, 30, 40, 50]
})

# Define constraints (optional)
constraints = {
    'input1': (0, 10),  # input1 between 0 and 10
    'input2': (None, 15)  # input2 maximum 15, no minimum
}

# Run optimization
results = ConductLinearOptimization(
    dataframe=data,
    output_variable='output',
    list_of_input_variables=['input1', 'input2'],
    optimization_type='maximize',
    input_constraints=constraints
)

CreateContentBasedRecommender

Generates a customized recommendation engine using a "two-tower" neural network architecture to match users with highly relevant items or resources based on shared characteristics. This is particularly useful for personalized content delivery, such as matching patients to clinical trials or identifying relevant intelligence reports.

from analysistoolbox.prescriptive_analytics import CreateContentBasedRecommender
import pandas as pd

# Create a movie recommendation system
recommender = CreateContentBasedRecommender(
    dataframe=movie_ratings_df,
    outcome_variable='rating',
    user_list_of_predictor_variables=['age', 'gender', 'occupation'],
    item_list_of_predictor_variables=['genre', 'year', 'director', 'budget'],
    user_number_of_hidden_layers=2,
    item_number_of_hidden_layers=2,
    number_of_recommendations=5,
    scale_variables=True,
    plot_loss=True
)

Probability

The probability module provides tools for working with probability distributions and statistical models:

ProbabilityOfAtLeastOne

Calculates and visualizes the probability of at least one event occurring in a series of independent trials.

from analysistoolbox.probability import ProbabilityOfAtLeastOne

# Calculate probability of at least one defect in 10 products
# given a 5% defect rate per product
prob = ProbabilityOfAtLeastOne(
    probability_of_event=0.05,
    number_of_events=10,
    format_as_percent=True,
    show_plot=True,
    risk_tolerance=0.20  # Highlight 20% risk threshold
)

# Calculate probability of at least one successful sale
# given 30 customer interactions with 15% success rate
prob = ProbabilityOfAtLeastOne(
    probability_of_event=0.15,
    number_of_events=30,
    format_as_percent=True,
    show_plot=True,
    title_for_plot="Sales Success Probability",
    subtitle_for_plot="Probability of at least one sale in 30 customer interactions"
)

Simulations

The simulations module provides a comprehensive set of tools for statistical simulations and probability distributions:

CreateMetalogDistribution

Creates a flexible metalog distribution from data, useful for modeling complex probability distributions.

from analysistoolbox.simulations import CreateMetalogDistribution

# Create a metalog distribution from historical data
distribution = CreateMetalogDistribution(
    dataframe=df,
    variable='sales',
    lower_bound=0,
    number_of_samples=10000,
    plot_metalog_distribution=True
)

CreateMetalogDistributionFromPercentiles

Builds a metalog distribution from known percentile values.

from analysistoolbox.simulations import CreateMetalogDistributionFromPercentiles

# Create distribution from percentiles
distribution = CreateMetalogDistributionFromPercentiles(
    list_of_values=[10, 20, 30, 50],
    list_of_percentiles=[0.1, 0.25, 0.75, 0.9],
    lower_bound=0,
    show_distribution_plot=True
)

CreateSIPDataframe

Generates Stochastically Indexed Percentiles (SIP) for uncertainty analysis.

from analysistoolbox.simulations import CreateSIPDataframe

# Create SIP dataframe for risk analysis
sip_df = CreateSIPDataframe(
    number_of_percentiles=10,
    number_of_trials=1000
)

CreateSLURPDistributionFromLinearRegression

Creates a SIP with relationships preserved (SLURP) based on a linear regression model's prediction interval.

from analysistoolbox.simulations import CreateSLURPDistributionFromLinearRegression

# Create a SLURP distribution from a linear regression model
slurp_dist = CreateSLURPDistributionFromLinearRegression(
    linear_regression_model=model,  # statsmodels regression model
    list_of_prediction_values=[x1, x2, ...],  # values for predictors
    number_of_trials=10000,  # number of samples to generate
    prediction_interval=0.95,  # confidence level for prediction interval
    lower_bound=None,  # optional lower bound constraint
    upper_bound=None  # optional upper bound constraint
)

CreateSLURPDistributionFromLogisticRegression

Creates a SLURP distribution for logistic regression models using confidence intervals for predicted probabilities.

from analysistoolbox.simulations import CreateSLURPDistributionFromLogisticRegression

# Create a SLURP distribution from a logistic regression model
slurp_dist = CreateSLURPDistributionFromLogisticRegression(
    logistic_regression_model=model,  # statsmodels logistic regression model
    list_of_prediction_values=[x1, x2, ...],  # values for predictors
    number_of_trials=10000,  # number of samples to generate
    prediction_interval=0.95,  # confidence level for predicted probabilities
    lower_bound=0,  # lower bound for probabilities (default: 0)
    upper_bound=1,  # upper bound for probabilities (default: 1)
    show_distribution_plot=True  # show the probability distribution plot
)

CreateSLURPDistributionFromExponentialSmoothing

Creates a SLURP distribution for exponential smoothing models using prediction intervals for time series forecasts.

from analysistoolbox.simulations import CreateSLURPDistributionFromExponentialSmoothing

# Create a SLURP distribution from an exponential smoothing model
slurp_dist = CreateSLURPDistributionFromExponentialSmoothing(
    exponential_smoothing_model=model,  # statsmodels exponential smoothing model
    forecast_steps=1,  # number of steps ahead to forecast
    number_of_trials=10000,  # number of samples to generate
    prediction_interval=0.95,  # confidence level for prediction interval
    lower_bound=None,  # optional lower bound constraint
    upper_bound=None,  # optional upper bound constraint
    show_distribution_plot=True  # show the forecast distribution plot
)

SimulateCountOfSuccesses

Simulates binomial outcomes (number of successes in fixed trials).

from analysistoolbox.simulations import SimulateCountOfSuccesses

# Simulate customer conversion rates
results = SimulateCountOfSuccesses(
    probability_of_success=0.15,
    sample_size_per_trial=100,
    number_of_trials=10000,
    plot_simulation_results=True
)

SimulateCountOutcome

Simulates Poisson-distributed count data.

from analysistoolbox.simulations import SimulateCountOutcome

# Simulate daily customer arrivals
arrivals = SimulateCountOutcome(
    expected_count=25,
    number_of_trials=10000,
    plot_simulation_results=True
)

SimulateCountUntilFirstSuccess

Simulates geometric distributions (trials until first success).

from analysistoolbox.simulations import SimulateCountUntilFirstSuccess

# Simulate number of attempts until success
attempts = SimulateCountUntilFirstSuccess(
    probability_of_success=0.2,
    number_of_trials=10000,
    plot_simulation_results=True
)

SimulateNormallyDistributedOutcome

Generates normally distributed random variables.

from analysistoolbox.simulations import SimulateNormallyDistributedOutcome

# Simulate product weights
weights = SimulateNormallyDistributedOutcome(
    mean=100,
    standard_deviation=5,
    number_of_trials=10000,
    plot_simulation_results=True
)

SimulateTDistributedOutcome

Generates Student's t-distributed random variables.

from analysistoolbox.simulations import SimulateTDistributedOutcome

# Simulate with heavy-tailed distribution
values = SimulateTDistributedOutcome(
    degrees_of_freedom=5,
    number_of_trials=10000,
    plot_simulation_results=True
)

SimulateTimeBetweenEvents

Simulates exponentially distributed inter-arrival times.

from analysistoolbox.simulations import SimulateTimeBetweenEvents

# Simulate time between customer arrivals
times = SimulateTimeBetweenEvents(
    average_time_between_events=30,
    number_of_trials=10000,
    plot_simulation_results=True
)

SimulateTimeUntilNEvents

Simulates Erlang-distributed waiting times.

from analysistoolbox.simulations import SimulateTimeUntilNEvents

# Simulate time until 5 events occur
wait_time = SimulateTimeUntilNEvents(
    average_time_between_events=10,
    number_of_events=5,
    number_of_trials=10000,
    plot_simulation_results=True
)

Statistics

The statistics module provides essential tools for statistical inference and estimation:

CalculateConfidenceIntervalOfMean

Calculates confidence intervals for population means, automatically handling both large (z-distribution) and small (t-distribution) sample sizes.

from analysistoolbox.statistics import CalculateConfidenceIntervalOfMean

# Calculate 95% confidence interval for average customer spending
ci_results = CalculateConfidenceIntervalOfMean(
    sample_mean=45.2,
    sample_standard_deviation=12.5,
    sample_size=100,
    confidence_interval=0.95,
    plot_sample_distribution=True,
    value_name="Average Spending ($)"
)

CalculateConfidenceIntervalOfProportion

Calculates confidence intervals for population proportions, with automatic selection of the appropriate distribution based on sample size.

from analysistoolbox.statistics import CalculateConfidenceIntervalOfProportion

# Calculate 95% confidence interval for customer satisfaction rate
ci_results = CalculateConfidenceIntervalOfProportion(
    sample_proportion=0.78,  # 78% satisfaction rate
    sample_size=200,         # 200 survey responses
    confidence_interval=0.95,
    plot_sample_distribution=True,
    value_name="Satisfaction Rate"
)

Visualizations

The visualizations module provides a comprehensive set of tools for creating publication-quality statistical plots and charts:

Plot100PercentStackedBarChart

Creates a 100% stacked bar chart for comparing proportional compositions across categories.

from analysistoolbox.visualizations import Plot100PercentStackedBarChart

# Create a stacked bar chart showing customer segments by region
chart = Plot100PercentStackedBarChart(
    dataframe=df,
    categorical_column_name='Region',
    value_column_name='Customers',
    grouping_column_name='Segment'
)

PlotBarChart

Creates a customizable bar chart with options for highlighting specific categories.

from analysistoolbox.visualizations import PlotBarChart

# Create a bar chart of sales by product
chart = PlotBarChart(
    dataframe=df,
    categorical_column_name='Product',
    value_column_name='Sales',
    top_n_to_highlight=3,
    highlight_color="#b0170c"
)

PlotBoxWhiskerByGroup

Creates box-and-whisker plots for comparing distributions across groups.

from analysistoolbox.visualizations import PlotBoxWhiskerByGroup

# Compare salary distributions across departments
plot = PlotBoxWhiskerByGroup(
    dataframe=df,
    value_column_name='Salary',
    grouping_column_name='Department'
)

PlotBulletChart

Creates bullet charts for comparing actual values against targets with optional range bands.

from analysistoolbox.visualizations import PlotBulletChart

# Create bullet chart comparing actual vs target sales
chart = PlotBulletChart(
    dataframe=df,
    value_column_name='Actual_Sales',
    grouping_column_name='Region',
    target_value_column_name='Target_Sales',
    list_of_limit_columns=['Min_Sales', 'Max_Sales']
)

PlotCard

Creates a simple card-style visualization with a value and an optional value label.

from analysistoolbox.visualizations import PlotCard

# Create a simple KPI card
card = PlotCard(
    value=125000,  # main value to display
    value_label="Monthly Revenue",  # optional label
    value_font_size=30,  # size of the main value
    value_label_font_size=14,  # size of the label
    figure_size=(3, 2)  # dimensions of the card
)

PlotClusteredBarChart

Creates grouped bar charts for comparing multiple categories across groups.

from analysistoolbox.visualizations import PlotClusteredBarChart

# Create clustered bar chart of sales by product and region
chart = PlotClusteredBarChart(
    dataframe=df,
    categorical_column_name='Product',
    value_column_name='Sales',
    grouping_column_name='Region'
)

PlotContingencyHeatmap

Creates a heatmap visualization of contingency tables.

from analysistoolbox.visualizations import PlotContingencyHeatmap

# Create heatmap of customer segments vs purchase categories
heatmap = PlotContingencyHeatmap(
    dataframe=df,
    categorical_column_name_1='Customer_Segment',
    categorical_column_name_2='Purchase_Category',
    normalize_by="columns"
)

PlotCorrelationMatrix

Creates correlation matrix visualizations with optional scatter plots.

from analysistoolbox.visualizations import PlotCorrelationMatrix

# Create correlation matrix of numeric variables
matrix = PlotCorrelationMatrix(
    dataframe=df,
    list_of_value_column_names=['Age', 'Income', 'Spending'],
    show_as_pairplot=True
)

PlotDensityByGroup

Creates density plots for comparing distributions across groups.

from analysistoolbox.visualizations import PlotDensityByGroup

# Compare age distributions across customer segments
plot = PlotDensityByGroup(
    dataframe=df,
    value_column_name='Age',
    grouping_column_name='Customer_Segment'
)

PlotDotPlot

Creates dot plots with optional connecting lines between groups.

from analysistoolbox.visualizations import PlotDotPlot

# Compare before/after measurements
plot = PlotDotPlot(
    dataframe=df,
    categorical_column_name='Metric',
    value_column_name='Value',
    group_column_name='Time_Period',
    connect_dots=True
)

PlotHeatmap

Creates customizable heatmaps for visualizing two-dimensional data.

from analysistoolbox.visualizations import PlotHeatmap

# Create heatmap of customer activity by hour and day
heatmap = PlotHeatmap(
    dataframe=df,
    x_axis_column_name='Hour',
    y_axis_column_name='Day',
    value_column_name='Activity',
    color_palette="RdYlGn"
)

PlotOverlappingAreaChart

Creates stacked or overlapping area charts for time series data.

from analysistoolbox.visualizations import PlotOverlappingAreaChart

# Show product sales trends over time
chart = PlotOverlappingAreaChart(
    dataframe=df,
    time_column_name='Date',
    value_column_name='Sales',
    variable_column_name='Product'
)

PlotRiskTolerance

Creates specialized plots for risk analysis and tolerance visualization.

from analysistoolbox.visualizations import PlotRiskTolerance

# Visualize risk tolerance levels
plot = PlotRiskTolerance(
    dataframe=df,
    value_column_name='Risk_Score',
    tolerance_level_column_name='Tolerance'
)

PlotScatterplot

Creates scatter plots with optional trend lines and grouping.

from analysistoolbox.visualizations import PlotScatterplot

# Create scatter plot of age vs income
plot = PlotScatterplot(
    dataframe=df,
    x_axis_column_name='Age',
    y_axis_column_name='Income',
    color_by_column_name='Education'
)

PlotSingleVariableCountPlot

Creates count plots for categorical variables.

from analysistoolbox.visualizations import PlotSingleVariableCountPlot

# Show distribution of customer types
plot = PlotSingleVariableCountPlot(
    dataframe=df,
    categorical_column_name='Customer_Type',
    top_n_to_highlight=2
)

PlotSingleVariableHistogram

Creates histograms for continuous variables.

from analysistoolbox.visualizations import PlotSingleVariableHistogram

# Create histogram of transaction amounts
plot = PlotSingleVariableHistogram(
    dataframe=df,
    value_column_name='Transaction_Amount',
    show_mean=True,
    show_median=True
)

PlotTimeSeries

Creates time series plots with optional grouping and marker sizes.

from analysistoolbox.visualizations import PlotTimeSeries

# Plot monthly sales with grouping
plot = PlotTimeSeries(
    dataframe=df,
    time_column_name='Date',
    value_column_name='Sales',
    grouping_column_name='Region',  # optional grouping
    marker_size_column_name='Volume',  # optional markers
    line_color='#3269a8',
    figure_size=(8, 5)
)

RenderTableOne

Creates publication-ready summary statistics tables comparing variables across groups.

from analysistoolbox.visualizations import RenderTableOne

# Create summary statistics table comparing age, education by department
table = RenderTableOne(
    dataframe=df,
    value_column_name='Age',  # outcome variable
    grouping_column_name='Department',  # grouping variable
    list_of_row_variables=['Education', 'Experience'],  # variables to compare
    table_format='html',  # output format
    show_p_value=True  # include statistical tests
)

Contributions

Contributions to the analysistoolbox package are welcome! Please submit a pull request with your changes.

License

The analysistoolbox package is licensed under the GNU License. Read more about the GNU License at https://www.gnu.org/licenses/gpl-3.0.html.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

analysistoolbox-3.9.91.tar.gz (309.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

analysistoolbox-3.9.91-py3-none-any.whl (413.2 kB view details)

Uploaded Python 3

File details

Details for the file analysistoolbox-3.9.91.tar.gz.

File metadata

  • Download URL: analysistoolbox-3.9.91.tar.gz
  • Upload date:
  • Size: 309.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for analysistoolbox-3.9.91.tar.gz
Algorithm Hash digest
SHA256 4af49dffc702cf656172c394f3f4e32a6580acfa58f4a1d8fcdd607f7eff6c01
MD5 3baed008032f517a389b322cf9f9070e
BLAKE2b-256 52704232272a94708b9e2577afcc90dfce531f9978ad7d10863260a27ce6a81c

See more details on using hashes here.

File details

Details for the file analysistoolbox-3.9.91-py3-none-any.whl.

File metadata

File hashes

Hashes for analysistoolbox-3.9.91-py3-none-any.whl
Algorithm Hash digest
SHA256 02b7d59dfde9cee1f683fdeebd8abe709d7291ea90659ce51089507e9fd0af7d
MD5 33f2a3dd65e942fa3f89e054b8be0d54
BLAKE2b-256 63eb2d44faf95819c412a66d53ee45a105387061f7bea24e7c52b66e185a2ed5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page