A toolkit for exploratory data analysis in Python

These details have not been verified by PyPI

Project links

Project description

Exploratory Data Analysis (EDA) Python Toolkit

The project is sponsored by Malmö Universitet developed by Eng. Marco Schivo and Eng. Alberto Biscalchin under the supervision of Associete Professor Yuanji Cheng and is released under the MIT License. It is open source and available for anyone to use and contribute to.

Internal course Code reference: MA661E

Installation Instructions

To install the pyEDAkit package, follow the steps below:

Prerequisites:
Ensure you have Python 3.7 or higher installed. You can download it from python.org.
Install from PyPI:
Use the following command to install pyEDAkit:
```
pip install pyEDAkit
```

Verify Installation:
After installation, verify it by running:

python -c "import pyEDAkit; print('pyEDAkit installed successfully!')"

Optional (Upgrade):
To upgrade to the latest version:
```
pip install --upgrade pyEDAkit
```

You're now ready to use pyEDAkit.

Warnings

Please note that this project is still under development and is not yet ready for production use. The code is subject to change, and new features will be added over time. We welcome contributions and feedback from the community to improve the toolkit.

Overview

This repository implements MATLAB-style functions in Python for various data analysis, clustering, dimensionality reduction, and graph algorithms. These functions leverage popular Python libraries such as numpy, scipy, matplotlib, networkx, and scikit-learn, while maintaining a familiar MATLAB-like syntax and behavior.

Features

1. Clustering and Dimensionality Reduction

Hierarchical Clustering (linkage, cluster):
- MATLAB-style hierarchical clustering using scipy.cluster.hierarchy.
- Supports single, complete, average, ward, and other linkage methods.
- MATLAB-like cluster function for cutting hierarchical clusters based on distance or cluster count.
K-Means Clustering (kmeans):
- MATLAB-style K-Means implementation with support for:
  - Initialization methods: k-means++, random, or user-specified.
  - Metrics: sqeuclidean distance.
  - Number of replicates and maximum iterations.
- Outputs cluster assignments, centroids, and within-cluster sum of distances.
Silhouette Analysis (silhouette):
- Computes silhouette values to evaluate clustering quality.
- Supports various distance metrics, including euclidean, manhattan, cosine, and minkowski.
- Generates detailed silhouette plots for cluster visualization.
Silhouette Criterion Evaluation (SilhouetteEvaluation class):
- Evaluates clustering solutions for different cluster counts (k) using the silhouette criterion.
- Identifies the optimal number of clusters (OptimalK) based on silhouette scores.
- Handles missing data and supports weighted or unweighted silhouette averages.
- Visualizes silhouette criterion scores vs. the number of clusters.
PCA and SVD:
- PCA: Computes Principal Components and visualizes scree plots and scatter matrices.
- SVD: Performs Singular Value Decomposition with visualization of singular values.
Non-Negative Matrix Factorization (NMF):
- Reduces data dimensionality while ensuring non-negativity constraints.
Factor Analysis (FA):
- Estimates latent factors using sklearn's FactorAnalysis.
Linear Discriminant Analysis (LDA):
- Reduces dimensionality while maximizing class separability.
Random Projection:
- Performs Gaussian Random Projection to reduce dimensionality.

2. Graph Algorithms

Minimum Spanning Tree (MST) (minspantree):
- MATLAB-style wrapper using networkx.
- Supports both Prim's (dense) and Kruskal's (sparse) algorithms.
- Option to extract a spanning tree for a specific connected component or spanning forest.

3. Intrinsic Dimensionality Estimation

Packing Numbers:
- Computes intrinsic dimensionality using packing arguments.
Maximum Likelihood Estimation (MLE):
- Estimates intrinsic dimensionality using a k-Nearest Neighbor approach.
Correlation Dimension:
- Estimates intrinsic dimensionality via correlation methods.
Pettis Method:
- Computes intrinsic dimensionality using Pettis et al.'s algorithm.

4. Normalization

Standardization (with_std_dev):
- Standardizes data with zero-mean and unit variance.
Min-Max Normalization (min_max_norm):
- Rescales data to a range between 0 and 1.
Sphering (sphering):
- Whitens data, decorrelating variables and setting variance to 1.

5. Clustering Quality Metrics

Cophenetic Correlation (cophenet):
- Computes the cophenetic correlation coefficient to measure how well a dendrogram preserves the original pairwise distances.
Silhouette Evaluation and Visualization:
- silhouette: Computes silhouette values and plots silhouette scores for each cluster.
- SilhouetteEvaluation: Evaluates and visualizes the silhouette criterion for determining the optimal number of clusters.

Examples

!IMPORTANT: The examples are not finished yet, they are just a draft of what we are going to implement. The import statements are placeholders and need to be replaced with the actual module name that will be available through PiPy soon.

Clustering

`Linkage` Function

This example demonstrates the usage of the linkage function for hierarchical clustering. The linkage function builds a hierarchical cluster tree (also known as a dendrogram) using various linkage methods. We show two use cases: clustering a large dataset into groups and visualizing the hierarchy using a dendrogram.

Python Code:

import numpy as np
import matplotlib.pyplot as plt
from pyEDAkit.clustering import linkage, cluster
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
from mpl_toolkits.mplot3d import Axes3D

def test_linkage():
    # Step 1: Randomly generate sample data with 20,000 observations
    np.random.seed(0)  # For reproducibility
    X = np.random.rand(20000, 3)

    # Step 2: Create a hierarchical cluster tree using the ward linkage method
    Z = linkage(X, method='ward')

    # Step 3: Cluster the data into a maximum of four groups
    max_clusters = 4
    cluster_labels = cluster(Z, 'MaxClust', max_clusters, criterion='maxclust')

    # Step 4: Plot the result in 3D
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')

    scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=cluster_labels, cmap='viridis', s=10)
    ax.set_xlabel('X-axis')
    ax.set_ylabel('Y-axis')
    ax.set_zlabel('Z-axis')

    plt.title('3D Scatter Plot of Hierarchical Clustering')
    plt.colorbar(scatter, ax=ax, label='Cluster Label')
    plt.show()

    # Step 5: Define the dissimilarity matrix
    X = np.array([
        [0, 1, 2, 3],
        [1, 0, 4, 5],
        [2, 4, 0, 6],
        [3, 5, 6, 0]
    ])

    # Step 6: Convert the dissimilarity matrix to vector form using squareform
    y = squareform(X)

    # Step 7: Create a hierarchical cluster tree using the 'complete' method
    Z = linkage(y, method='complete')

    # Step 8: Print the resulting linkage matrix
    print("Linkage matrix (Z):")
    print(Z)

    # Step 9: Plot the dendrogram
    plt.figure(figsize=(10, 6))
    dendrogram(
        Z,
        labels=[1, 2, 3, 4],  # Use MATLAB-style indices for the leaf nodes
        leaf_font_size=10      # Adjust font size for clarity
    )
    plt.title('Dendrogram')
    plt.xlabel('Leaf Nodes')
    plt.ylabel('Linkage Distance')
    plt.show()

test_linkage()

Visualizations

3D Scatter Plot of Hierarchical Clustering

This plot visualizes the clusters formed by hierarchical clustering on a randomly generated dataset of 20,000 observations. The data points are colored by their cluster labels (maximum of 4 clusters).

3D Scatter Plot

Dendrogram

The dendrogram represents the hierarchical clustering of a small dataset, built from a dissimilarity matrix. The complete linkage method is used to compute the hierarchical structure, and the result is visualized as a dendrogram.

Dendrogram

`Cluster` Function

The cluster function is a MATLAB-style wrapper for SciPy's fcluster function, allowing flexible and intuitive hierarchical clustering. This example demonstrates its usage with various clustering criteria, such as distance thresholds, inconsistent measures, and a fixed number of clusters. Additionally, it supports multiple cutoffs to produce a matrix of cluster assignments.

Example:

import numpy as np
from pyEDAkit.clustering import cluster, linkage 

def test_cluster():
    # Generate sample data
    X = np.random.rand(10, 3)

    # Compute linkage matrix
    Z = linkage(X, method='ward')

    # 1) Cut off by distance = 0.7
    T_distance = cluster(Z, 'Cutoff', 0.7, 'Criterion', 'distance')

    # 2) Cut off by inconsistent measure
    T_inconsist = cluster(Z, 'Cutoff', 1.5)

    # 3) Force a maximum of 3 clusters
    T_maxclust = cluster(Z, 'MaxClust', 3)

    # 4) Multiple cutoffs -> T is an m-by-l matrix
    T_multi = cluster(Z, 'Cutoff', [0.7, 1.0, 1.5], 'Criterion', 'distance')
    print(T_multi.shape)  # (10, 3)

test_cluster()

Output:

This example showcases the flexibility of the cluster function. Below is the output from the final step, where multiple cutoffs are used:

(10, 3)

Key Points:

Cut off by Distance: Creates clusters by specifying a distance threshold. For example:
```
T_distance = cluster(Z, 'Cutoff', 0.7, 'Criterion', 'distance')
```
Cut off by Inconsistent Measure: Uses the default 'inconsistent' criterion for clustering:
```
T_inconsist = cluster(Z, 'Cutoff', 1.5)
```
Force a Maximum Number of Clusters: Ensures the data is divided into a fixed number of clusters:
```
T_maxclust = cluster(Z, 'MaxClust', 3)
```
Multiple Cutoffs: Produces a matrix where each column corresponds to cluster assignments for a specific cutoff:
```
T_multi = cluster(Z, 'Cutoff', [0.7, 1.0, 1.5], 'Criterion', 'distance')
```

The flexibility of cluster makes it an ideal choice for hierarchical clustering tasks requiring MATLAB-like functionality in Python.

`K-means` Clustering

The kmeans function, imported from the pyEDAkit.clustering module, provides a MATLAB-style implementation of the K-means algorithm, allowing intuitive and flexible clustering with support for optional parameters such as the number of replicates and maximum iterations.

Example:

import numpy as np
import pandas as pd
from pyEDAkit.clustering import kmeans
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.metrics import accuracy_score

def test_kmeans():
    # Load Iris dataset
    iris_path = "../datasets/iris_dataset.csv"
    iris_data = pd.read_csv(iris_path)

    # Use only petal_length and petal_width features (2D data)
    X = iris_data.iloc[:, [2, 3]].values  # Columns for petal_length and petal_width
    y_true = iris_data.iloc[:, -1].values  # True labels (species)

    # Map the target labels to numeric values for true labels
    label_mapping = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
    y_numeric = np.array([label_mapping[label] for label in y_true])

    # Step 1: Apply k-means clustering using your custom function
    k = 3  # Number of clusters
    idx, C, sumd, D = kmeans(X, k, 'Distance', 'sqeuclidean', 'Replicates', 5, 'MaxIter', 300)

    # Step 2: Create a 2D grid for the feature space
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                         np.arange(y_min, y_max, 0.05))

    # Combine the grid into a list of points
    grid_points = np.c_[xx.ravel(), yy.ravel()]  # Flatten the grid into points for classification

    # Define the function to assign grid points to clusters
    def assign_to_clusters(grid_points, centroids):
        """
        Assign grid points to the nearest cluster based on centroids.
        """
        distances = np.linalg.norm(grid_points[:, None] - centroids, axis=2)  # Euclidean distance
        return np.argmin(distances, axis=1)  # Assign to the nearest centroid

    # Assign each grid point to a cluster
    Z = assign_to_clusters(grid_points, C)
    Z = Z.reshape(xx.shape)  # Reshape to match the grid dimensions

    # Step 4: Plot the results
    plt.figure(figsize=(10, 6))

    # Plot cluster regions
    cmap = ListedColormap(['#FFCCCC', '#CCFFCC', '#CCCCFF'])  # Colors for regions
    plt.contourf(xx, yy, Z, cmap=cmap, alpha=0.5)

    # Plot data points
    plt.scatter(X[:, 0], X[:, 1], c=idx - 1, cmap='viridis', edgecolor='k', s=50, label='Data')

    # Plot centroids
    plt.scatter(C[:, 0], C[:, 1], c='red', s=200, marker='X', label='Centroids')

    # Add labels and title
    plt.title("K-means Clustering with Correctly Colored Areas")
    plt.xlabel("Petal Length (cm)")
    plt.ylabel("Petal Width (cm)")
    plt.legend(loc='best')
    plt.show()

    # Step 5: Print results
    print("Cluster assignments (idx):")
    print(idx)
    print("\nCentroids (C):")
    print(C)
    print("\nWithin-cluster sum of distances (sumd):")
    print(sumd)



test_kmeans()

Centroids displacement plot

K-Means Example

Bash Output:

Cluster assignments (idx):
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 3.
 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]

Centroids (C):
[[5.59583333 2.0375    ]
 [1.464      0.244     ]
 [4.26923077 1.34230769]]

Within-cluster sum of distances (sumd):
[16.29166667  2.0384     13.05769231]

Key Points:

Basic Clustering:
- Cluster assignment, centroids, within-cluster sum of distances, and point-to-centroid distances are calculated.
Iris Dataset Example:
- Used the Iris dataset to cluster the data and compare with true labels for accuracy.
Visualization:
- PCA reduces the dimensionality for 2D visualization.
- Colored areas represent cluster boundaries, and centroids are marked with red X.
Accuracy:
- Clustering accuracy is calculated by mapping clusters to the closest true labels.

`minspantree` - Minimum Spanning Tree

The minspantree function computes the Minimum Spanning Tree (MST) of a given graph. It supports both Prim's and Kruskal's algorithms and can return the MST for a specific component (Type='tree') or the entire graph (Type='forest').

Example:

import networkx as nx
import matplotlib.pyplot as plt
from pyEDAkit.clustering import minspantree

def test_minspantree():
    # Create a graph with weighted edges
    G = nx.Graph()
    G.add_weighted_edges_from([
        (1, 2, 2.0),
        (2, 3, 1.5),
        (2, 4, 3.0),
        (1, 5, 4.0),
        (3, 5, 2.5),
        (4, 5, 1.0),
        (5, 6, 2.0)
    ])

    # Visualize the original graph
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 3, 1)
    pos = nx.spring_layout(G, seed=42)  # For consistent layout
    nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', node_size=1000, font_size=10)
    labels = nx.get_edge_attributes(G, 'weight')
    nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
    plt.title("Original Graph")

    # 1) Compute MST using Prim's algorithm (default method)
    T_prim, pred_prim = minspantree(G)  # Assuming minspantree implements Prim's by default

    # Visualize MST with Prim's algorithm
    plt.subplot(1, 3, 2)
    nx.draw(T_prim, pos, with_labels=True, node_color='lightgreen', edge_color='blue', node_size=1000, font_size=10)
    labels = nx.get_edge_attributes(T_prim, 'weight')
    nx.draw_networkx_edge_labels(T_prim, pos, edge_labels=labels)
    plt.title("MST (Prim's Algorithm)")

    # 2) Compute MST using Kruskal's algorithm (Method='sparse')
    T_kruskal, pred_kruskal = minspantree(G, 'Method', 'sparse', 'Root', 2, 'Type', 'forest')

    # Visualize MST with Kruskal's algorithm
    plt.subplot(1, 3, 3)
    nx.draw(T_kruskal, pos, with_labels=True, node_color='lightcoral', edge_color='purple', node_size=1000, font_size=10)
    labels = nx.get_edge_attributes(T_kruskal, 'weight')
    nx.draw_networkx_edge_labels(T_kruskal, pos, edge_labels=labels)
    plt.title("MST (Kruskal's Algorithm)")

    plt.tight_layout()
    plt.show()

    # Print details of the MSTs
    print("\n--- MST using Prim's Algorithm ---")
    print("Edges of T (Prim):", list(T_prim.edges(data=True)))
    print("Predecessors (Prim):", pred_prim)

    print("\n--- MST using Kruskal's Algorithm ---")
    print("Edges of T (Kruskal):", list(T_kruskal.edges(data=True)))
    print("Predecessors (Kruskal):", pred_kruskal)

test_minspantree()

Visualization

Original Graph: Displays the graph with all its nodes and edges, labeled with weights.
MST (Prim's Algorithm): Highlights the MST computed using Prim's algorithm, rooted at node 1.
MST (Kruskal's Algorithm): Displays the MST computed using Kruskal's algorithm, including a spanning forest for all components.

Minimum Spanning Tree

Bash Output

--- MST using Prim's Algorithm ---
Edges of T (Prim): [(1, 2, {'weight': 2.0}), (2, 3, {'weight': 1.5}), (3, 5, {'weight': 2.5}), (4, 5, {'weight': 1.0}), (5, 6, {'weight': 2.0})]
Predecessors (Prim): {1: 0, 2: 1, 3: 2, 4: 5, 5: 3, 6: 5}

--- MST using Kruskal's Algorithm ---
Edges of T (Kruskal): [(1, 2, {'weight': 2.0}), (2, 3, {'weight': 1.5}), (3, 5, {'weight': 2.5}), (4, 5, {'weight': 1.0}), (5, 6, {'weight': 2.0})]
Predecessors (Kruskal): {1: 2, 2: 0, 3: 2, 4: 5, 5: 3, 6: 5}

Process finished with exit code 0

Key Points

Prim's Algorithm:
- Grows the MST from a specific root node.
- Produces a single tree for the connected component containing the root.
Kruskal's Algorithm:
- Builds the MST by adding edges with the smallest weights.
- Can generate a spanning forest if the graph is disconnected.
Visualization:
- Easily compares the original graph with the MSTs generated by different algorithms.
Customizability:
- Supports options like specifying the root node and generating a forest for disconnected graphs.

`cophenet` - Cophenetic Correlation Coefficient

The cophenet function computes the cophenetic correlation coefficient, a measure of how well the hierarchical clustering structure reflects the pairwise distances among the data points. It also provides the cophenetic distances for the linkage matrix.

Enhanced Example:

import numpy as np
from pyEDAkit.IntrinsicDimensionality import pdist
from pyEDAkit.clustering import linkage, cophenet
import matplotlib.pyplot as plt

def test_cophenet():
    # Step 1: Generate sample data
    np.random.seed(42)
    X = np.vstack([
        np.random.rand(5, 2),       # Cluster 1
        np.random.rand(5, 2) + 5   # Cluster 2 (shifted by +5)
    ])

    # Step 2: Compute pairwise distances
    Y = pdist(X)

    # Step 3: Perform hierarchical clustering with average linkage
    Z = linkage(Y, method='average')

    # Step 4: Compute cophenetic correlation coefficient
    c, d = cophenet(Z, Y)

    # Step 5: Display results
    print("Expected Result: Pairwise distances")
    print("Y (first 10 values):", Y[:10])

    print("\nActual Result: Cophenetic distances")
    print("d (first 10 values):", d[:10])

    print("\nCophenetic correlation coefficient:", c)

    # Step 6: Visualize clustering with dendrogram
    plt.figure(figsize=(10, 5))
    plt.title("Dendrogram")
    plt.xlabel("Data Points")
    plt.ylabel("Distance")
    from scipy.cluster.hierarchy import dendrogram
    dendrogram(Z, leaf_rotation=90., leaf_font_size=10.)
    plt.show()

# Run the example
test_cophenet()

Explanation:

Input Data:
- A synthetic dataset X is created with two distinct clusters: one centered around (0, 0) and another around (5, 5).
Pairwise Distances:
- pdist computes the Euclidean distances between all pairs of points in X. These are the expected pairwise distances.
Hierarchical Clustering:
- The linkage function is used with the 'average' method to compute the hierarchical clustering.
Cophenetic Correlation Coefficient:
- The cophenet function computes:
  - The cophenetic correlation coefficient: A measure of how well the clustering represents the original pairwise distances. It ranges from -1 to 1, with values closer to 1 indicating better clustering fidelity.
  - The cophenetic distances: These are the distances implied by the hierarchical clustering.
Comparison:
- The example prints the expected pairwise distances (Y) and the actual cophenetic distances (d), along with the cophenetic correlation coefficient (c).
Visualization:
- The dendrogram visualizes the hierarchical clustering.

Expected Output:

Expected Result: Pairwise distances
Y (first 10 values): [0.5017136  0.82421549 0.32755369 0.33198071 6.83944824 6.92460439
 6.40512716 6.72486612 6.66463903 0.72642889]

Actual Result: Cophenetic distances
d (first 10 values): [0.5310849  0.7441755  0.32755369 0.5310849  6.90984673 6.90984673
 6.90984673 6.90984673 6.90984673 0.7441755 ]

Cophenetic correlation coefficient: 0.9966209146535578

Process finished with exit code 0

Cophenetic Correlation

Key Takeaways:

The cophenetic correlation coefficient (c = 0.994) indicates that the hierarchical clustering structure is a very good representation of the original pairwise distances.
The cophenetic distances (d) closely match the original distances in Y.
The dendrogram visually confirms the clustering structure, with two distinct groups evident.

`silhouette` - Evaluating Clustering Quality

The silhouette function computes silhouette scores for clustering and provides an optional visualization of the silhouette plot. It measures how similar an object is to its own cluster compared to other clusters, with scores ranging from -1 to 1. Higher scores indicate better-defined clusters.

Example Usage

import numpy as np
from pyEDAkit.clustering import silhouette

def test_silhouette():
    # Step 1: Generate synthetic data
    X = np.random.rand(20, 2)  # 20 data points with 2 features
    labels = np.random.randint(0, 3, size=20)  # Assign random labels for 3 clusters

    # Step 2: Silhouette analysis using default Euclidean distance with visualization
    s, fig = silhouette(X, labels)  # Default parameters
    print("Silhouette values:\n", s)

    # Step 3: Silhouette analysis using Minkowski distance (p=3) without visualization
    s2, _ = silhouette(X, labels, Distance='minkowski', DistParameter={'p': 3}, do_plot=False)
    print("Silhouette values with Minkowski distance, p=3:\n", s2)

# Run the example
test_silhouette()

Explanation

Input Data:
- X: The dataset with shape (n_samples, n_features). In this example, it consists of 20 randomly generated data points in 2D space.
- labels: Cluster assignments for each data point. Random labels are generated for three clusters.
Silhouette Analysis:
- The silhouette score is computed for each data point as:
  
  $s(i)=\frac{b(i)-a(i)}{\max(a(i),b(i))}$
  
  where:
  - $a(i)$: Average intra-cluster distance (distance to other points in the same cluster).
  - $b(i)$: Average nearest-cluster distance (distance to points in the nearest other cluster).
Default Case:
- By default, the function computes the silhouette scores using the Euclidean distance and visualizes the silhouette plot.
Custom Distance:
- In the second case, the function uses the Minkowski distance with ( p = 3 ) (specified using the DistParameter argument). No plot is produced (do_plot=False).
Output:
- s: An array of silhouette scores for each data point.
- fig: A matplotlib figure object showing the silhouette plot (if visualization is enabled).

Expected Output

For the given random data, the output will include:

Silhouette Scores (Default Distance):

 Silhouette values:
 [-0.20917609 -0.11501692 -0.09117026  0.07603928 -0.49128607 -0.34661769
 -0.20534282 -0.02257866 -0.15532452 -0.36940948 -0.26504272  0.32133183
 -0.08229476  0.30387826 -0.09858198 -0.26869494 -0.59434334  0.08777368
 0.11872492  0.06386749]

Silhouette Scores (Minkowski Distance, ( p=3 )):

 Silhouette values with Minkowski distance, p=3:
  [-0.19668877 -0.14506527 -0.06741624  0.07712329 -0.48447927 -0.32749879
  -0.1640281  -0.04252481 -0.14681262 -0.39123349 -0.26663026  0.3170098
  -0.09092208  0.30229521 -0.06645392 -0.27165836 -0.5954229   0.07757885
   0.11677314  0.06568025]

Visualization:
- The silhouette plot displays the silhouette scores for each cluster, helping to evaluate the compactness and separation of clusters. The cluster with the largest average silhouette score is better defined.

Silhouette Plot Visualization

The silhouette plot provides:

Bars: Represent silhouette scores for individual points, grouped by clusters.
Dashed Line: Represents the average silhouette score for each cluster.

This plot is useful to evaluate the clustering quality visually.

Notes

The silhouette score for a single point:
- Close to 1: Well-clustered.
- Close to 0: Overlapping clusters.
- Negative: Misclassified point.
Custom distance metrics can be specified via the Distance parameter (e.g., Minkowski, Manhattan, etc.).

Silhouette Evaluation Example

The Silhouette Evaluation example demonstrates how to use the SilhouetteEvaluation class to determine the optimal number of clusters (k) for a given dataset using the silhouette criterion. The class evaluates various cluster solutions and identifies the best one based on the silhouette measure.

Example Code:

from pyEDAkit.clustering import SilhouetteEvaluation
import numpy as np
from numpy.random import default_rng

def test_eval_silhouette():
    rng = default_rng(42)
    n = 200
    X1 = rng.multivariate_normal(mean=[2, 2], cov=[[0.9, -0.0255], [-0.0255, 0.9]], size=n)
    X2 = rng.multivariate_normal(mean=[5, 5], cov=[[0.5, 0], [0, 0.3]], size=n)
    X3 = rng.multivariate_normal(mean=[-2, -2], cov=[[1, 0], [0, 0.9]], size=n)
    X = np.vstack([X1, X2, X3])

    # Evaluate the silhouette for k=1..6
    evaluation = SilhouetteEvaluation(X,
                                      clusteringFunction='kmeans',
                                      KList=[1, 2, 3, 4, 5, 6],
                                      Distance='sqEuclidean',
                                      ClusterPriors='empirical')

    print("Inspected K:         ", evaluation.InspectedK)
    print("Criterion Values:    ", evaluation.CriterionValues)
    print("OptimalK:            ", evaluation.OptimalK)

    # Optional: plot the silhouette criterion
    evaluation.plot()

    # The best cluster solution's assignment:
    best_clusters = evaluation.OptimalY
    print("Shape of best cluster assignment:", best_clusters.shape)
    print("Some cluster labels:", np.unique(best_clusters[~np.isnan(best_clusters)]))

test_eval_silhouette()

Output and Plot

Inspected K: The list of cluster numbers (k) evaluated.
Criterion Values: Silhouette scores for each k. Higher values indicate better cluster separation.
Optimal K: The k with the highest silhouette score.

Example Output:

Inspected K:          [1 2 3 4 5 6]
Criterion Values:     [nan 0.65458239 0.67860193 0.54828688 0.45514697 0.33797165]
OptimalK:             3
Shape of best cluster assignment: (600,)
Some cluster labels: [1. 2. 3.]

Plot:

The silhouette values vs. the number of clusters are visualized in the plot below:

Silhouette Criterion Evaluation

How It Works

Cluster Evaluation:
The SilhouetteEvaluation class evaluates clusters for different k (e.g., 1 to 6) using the silhouette criterion:
- The silhouette score is computed for each data point as:
  
  $s(i)=\frac{b(i)-a(i)}{\max(a(i),b(i))}$
  
  where:
  - $a(i)$: Average intra-cluster distance (distance to other points in the same cluster).
  - $b(i)$: Average nearest-cluster distance (distance to points in the nearest other cluster).
Optimal Number of Clusters:
- The silhouette score is calculated for each k.
- The OptimalK property identifies the k with the maximum score.
Cluster Priors:
- 'empirical': Weighted by cluster sizes.
- 'equal': Equal weight for all clusters.
Visualization:
The plot() method shows the silhouette values for each k, aiding in identifying the best clustering solution.

This example demonstrates the importance of silhouette analysis for optimal clustering and provides an easy-to-follow implementation of the process.

Dimensionality Reduction Examples

This section provides examples demonstrating how to use various dimensionality reduction techniques and their visualizations with the pyEDAkit library. The dataset used in all examples is the Iris dataset, which includes features (sepal_length, sepal_width, petal_length, and petal_width) and classes representing three flower species (Iris-setosa, Iris-versicolor, and Iris-virginica).

Dataset Preparation

from pyEDAkit import linear as eda_lin
import pandas as pd
import numpy as np

# Make sure to import the dataset from your local directory
df = pd.read_csv("../datasets/iris/iris.data")
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

# Extract features and target labels
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()
y = df[['class']].to_numpy()[:,0]
y = np.where(y == 'Iris-setosa', 0, y)
y = np.where(y == 'Iris-versicolor', 1, y)
y = np.where(y == 'Iris-virginica', 2, y)
y = y.astype(int)
y_names = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

1. Principal Component Analysis (PCA)

Principal Component Analysis reduces the dataset’s dimensionality by projecting it onto a lower-dimensional subspace while retaining as much variance as possible.

eda_lin.PCA(X, d=3, plot=True)

Output (plot=True)

PCA Example

Explanation:
- The data is reduced to 3 principal components.
- The plot visualizes the variance explained by each component.
- Useful for understanding which components capture the most variance.

2. Singular Value Decomposition (SVD)

SVD decomposes the matrix without explicitly calculating the covariance matrix. It provides an efficient way to find the principal components.

eda_lin.SVD(X, plot=True)

Output (plot=True)

SVD Example

Explanation:
- Displays the singular values, which indicate the importance of each dimension.
- Useful for identifying the rank and energy captured by the decomposition.

3. Non-negative Matrix Factorization (NMF)

NMF decomposes a non-negative matrix into two smaller non-negative matrices. It is more efficient than SVD and suitable for datasets where negative values do not make sense.

eda_lin.NMF(X, d=4, plot=True)

Output (plot=True)

NMF Example

Explanation:
- Decomposes the data into 4 components.
- The plot shows the contribution of each component to the original dataset.

4. Factor Analysis (FA)

Factor Analysis reduces dimensionality by relating each original variable to a smaller set of factors while adding small error values to allow flexibility.

eda_lin.FA(X, d=3, plot=True)

Output (plot=True)

FA Example

Explanation:
- Reduces the data to 3 factors.
- Shows how the factors contribute to explaining the dataset's variance.

5. Linear Discriminant Analysis (LDA)

LDA projects the data onto a line that maximizes class separability, making it a supervised dimensionality reduction method.

eda_lin.LDA(X, y, plot=True)

Output (plot=True)

LDA Example

Explanation:
- Projects the data into a single line for maximum class separation.
- Visualization highlights how well the classes are separated along the discriminant axis.

6. Random Projection

Random Projection projects the data points into a random subspace while preserving the distances between points. It is computationally efficient and effective.

eda_lin.RandProj(X, d=3, plot=True)

Output (plot=True)

Random Projection Example

Explanation:
- Projects the data into a random 3-dimensional subspace.
- The plot visualizes the points in the new subspace, demonstrating that the relative distances between points are preserved.

Notes:

Visualization: Each method generates a plot when plot=True, helping users visually interpret the results of the dimensionality reduction.
Flexibility: The d parameter in most functions controls the number of reduced dimensions.
Interactivity: The methods allow exploration of how different dimensionality reduction techniques work with the same dataset.

Examples: Intrinsic Dimensionality Estimation with Code and Visualizations

This section provides examples of intrinsic dimensionality estimation for various datasets. The examples include:

3D Scene Analysis
1D Helix
3D Helix

Each example includes code snippets, results, and visualizations.

1. 3D Scene Analysis

In this example, we generate a synthetic 3D scene with varying intrinsic dimensions. We estimate the intrinsic dimensionality for each point using the MLE method and visualize the results.

Code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
import pandas as pd
from pyEDAkit.IntrinsicDimensionality import MLE
from generate_data import generate_scene

print('--------- Scene ---------')
X = generate_scene(plot=False)

# Initialize nearest neighbors
nbrs = NearestNeighbors(n_neighbors=100, algorithm='auto').fit(X)
_, indices = nbrs.kneighbors(X)

# Calculate intrinsic dimension for each point
intrinsic_dimensions = []
for idx in range(len(X)):
    neighbors = X[indices[idx]]
    dim = MLE(neighbors)
    dim = np.round(dim)
    intrinsic_dimensions.append(dim)

# Replace values > 3 with 4
intrinsic_dimensions = np.array(intrinsic_dimensions)
intrinsic_dimensions[intrinsic_dimensions > 3] = 4

# Calculate percentage of each intrinsic dimension
unique, counts = np.unique(intrinsic_dimensions, return_counts=True)
percentages = (counts / len(intrinsic_dimensions)) * 100
percentage_table = pd.DataFrame({
    'Intrinsic Dimension': unique,
    'Count': counts,
    'Percentage (%)': percentages
})

# Display the percentage table
print("Intrinsic Dimension Percentage Table:")
print(percentage_table)

# Plot the 3D graph with intrinsic dimensions as colors
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=intrinsic_dimensions, cmap='viridis', s=10)
ax.set_title('3D Scatter Plot Colored by Local Intrinsic Dimension (k=100)')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
colorbar = fig.colorbar(scatter, ax=ax, label='Intrinsic Dimension')
plt.show()

Results:

Percentage Table:

Intrinsic Dimension Count Percentage (%)

1.0 3912 65.20

2.0 1071 17.85

3.0 1017 16.95

Intrinsic Dimension	Count	Percentage (%)
1.0	3912	65.20
2.0	1071	17.85
3.0	1017	16.95

Visualization:

3D Scatter Plot with Intrinsic Dimensions

2. 1D Helix

This example demonstrates intrinsic dimensionality estimation for a 1D helix dataset using multiple methods.

Code:

from pyEDAkit.IntrinsicDimensionality import id_pettis, corr_dim, MLE, packing_numbers
from generate_data import generate_1D_helix

print('--------- 1D helix ---------')
X = generate_1D_helix(500, plot=True)

idhat = id_pettis(X)
print("Pettis:", idhat)

idhat = corr_dim(X)
print("CorrDim:", idhat)

idhat = MLE(X)
print("MLE:", idhat)

idhat = packing_numbers(X)
print("PackingNumbers:", idhat)

Results:

Method	Estimated Intrinsic Dimension
Pettis	1.12
CorrDim	1.05
MLE	1.02
PackingNumbers	0.97

Visualization:

1D Helix

3. 3D Helix

Finally, we estimate intrinsic dimensionality for a 3D helix dataset.

Code:

from pyEDAkit.IntrinsicDimensionality import id_pettis, corr_dim, MLE, packing_numbers
from generate_data import generate_3D_helix

print('--------- 3D helix ---------')
X, _ = generate_3D_helix(2000, 0.05, plot=True)

idhat = id_pettis(X)
print("Pettis:", idhat)

idhat = corr_dim(X)
print("CorrDim:", idhat)

idhat = MLE(X)
print("MLE:", idhat)

idhat = packing_numbers(X)
print("PackingNumbers:", idhat)

Results:

Method	Estimated Intrinsic Dimension
Pettis	2.97
CorrDim	1.92
MLE	2.24
PackingNumbers	1.45

Visualization:

3D Helix

These examples illustrate the application of various intrinsic dimensionality estimation methods to datasets with different geometries. The visualizations help validate the results by showing dimensionality estimates in their natural geometric contexts.

Normalization Example

This example demonstrates the usage of the normalization functions implemented in pyEDAkit, comparing the results with standard libraries like scikit-learn. These functions include z-score normalization (with and without mean centering), min-max normalization, and sphering. Visualizations and outputs illustrate their effects on the Iris dataset.

Dataset Overview

We use the Iris dataset, focusing on sepal_length and petal_length features for visualization. The dataset contains three classes: Iris-setosa, Iris-versicolor, and Iris-virginica.

Original Data

The original data is plotted to show the unnormalized feature values.

Original Data

Z-Scores with Zero Mean

Z-score normalization scales data to have a standard deviation of 1 and a mean of 0.

Implementation Comparison: Results are identical to scikit-learn's StandardScaler (with default settings).

Output:

Is z_scores_zero_mean allclose to sklearn: True
std:  [1. 1.] 
mean:  [ 2.38437160e-16 -9.53748639e-17]

Z-Scores with Zero Mean

Z-Scores Without Zero Mean

Z-score normalization scales data to have a standard deviation of 1 but does not subtract the mean.

Implementation Comparison: Matches scikit-learn's StandardScaler (with with_mean=False).

Output:

Is z_scores_not_zero_mean allclose to sklearn: True
std:  [1. 1.] 
mean:  [7.08193195 2.15226003]

Z-Scores Without Zero Mean

Min-Max Normalization

Min-max normalization scales data to fit within the range [0, 1].

Implementation Comparison: Results are identical to scikit-learn's MinMaxScaler.

Output:

Is min-max norm allclose to sklearn: True
std:  [0.22939135 0.29724345] 
mean:  [0.43008949 0.47025367]

Min-Max Normalization

Sphering

Sphering, also known as whitening, removes correlations between features and scales them to have unit variance.

Implementation Comparison: Similar to scikit-learn's PCA(whiten=True) but differs due to possible rotations or reflections. These differences are expected.

Output:

Is sphering allclose to sklearn: False
std:  [0.99663865 0.99663865] 
mean:  [-5.42444538e-16  3.05497611e-17]
Rotation-tolerant match for sphering vs PCA whiten:  True

Visualizations:

Sphering (pyEDAkit):
Sphering (PCA Whiten):

Code

Below is the complete code used in this example:

from pyEDAkit import standardization as eda_std
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Scatter plot function
def scatter_plot(x, y, targets, title='Title', class_names=None):
    plt.figure()
    unique_labels = np.unique(targets)
    for lbl in unique_labels:
        mask = (targets == lbl)
        label_str = class_names[lbl] if class_names else f"Class {lbl}"
        plt.scatter(x[mask], y[mask], s=10, label=label_str)
    plt.axhline(0, color='gray', linestyle='--')
    plt.axvline(0, color='gray', linestyle='--')
    plt.title(title)
    plt.legend()
    plt.draw()

# Load Iris dataset
df = pd.read_csv("../datasets/iris/iris.data")
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
sp_df = df[['sepal_length', 'petal_length']].to_numpy()

y = df['class'].to_numpy()
y = np.where(y == 'Iris-setosa', 0, y)
y = np.where(y == 'Iris-versicolor', 1, y)
y = np.where(y == 'Iris-virginica', 2, y)
y = y.astype(int)
y_names = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Normalization techniques and visualization
scatter_plot(df['sepal_length'], df['petal_length'], y, title='Original data', class_names=y_names)

# Z-scores with mean 0
z_scores_zero_mean = eda_std.with_std_dev(sp_df, zero_mean=True)
scatter_plot(z_scores_zero_mean[:, 0], z_scores_zero_mean[:, 1], y, title='z-scores with mean 0', class_names=y_names)

# Z-scores without mean 0
z_scores_not_zero_mean = eda_std.with_std_dev(sp_df, zero_mean=False)
scatter_plot(z_scores_not_zero_mean[:, 0], z_scores_not_zero_mean[:, 1], y, title='z-scores with NOT mean 0', class_names=y_names)

# Min-max normalization
Z_minmax = eda_std.min_max_norm(sp_df)
scatter_plot(Z_minmax[:, 0], Z_minmax[:, 1], y, title='min-max normalization', class_names=y_names)

# Sphering
Z_sphere = eda_std.sphering(sp_df)
scatter_plot(Z_sphere[:, 0], Z_sphere[:, 1], y, title='Sphering (pyEDAkit)', class_names=y_names)

pca = PCA(whiten=True)
pca_data = pca.fit_transform(sp_df)
scatter_plot(pca_data[:, 0], pca_data[:, 1], y, title='Sphering (PCA whiten)', class_names=y_names)

Dependencies

This repository requires the following Python libraries:

numpy>=1.21.0
scipy>=1.7.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0
networkx>=2.5
pandas>=1.3.0

Contributing

Feel free to fork this repository, report issues, or contribute by adding new MATLAB-style functions or improving existing ones.

License

This project is licensed under the MIT License. See LICENSE for more details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.4

Jan 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyedakit-0.1.4.tar.gz (68.2 kB view details)

Uploaded Jan 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyEDAkit-0.1.4-py3-none-any.whl (36.2 kB view details)

Uploaded Jan 18, 2025 Python 3

File details

Details for the file pyedakit-0.1.4.tar.gz.

File metadata

Download URL: pyedakit-0.1.4.tar.gz
Upload date: Jan 18, 2025
Size: 68.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for pyedakit-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`3708742c83bd91eb99cb9220edb510f4942887e1d1615e245cd86c19e0865c7a`
MD5	`ec2b5b2efa6fcdcbfea09ea6e84623a5`
BLAKE2b-256	`f649200063563637d5fb9c71f8ae8208a388cb4d6f937ac42c3db051eb65e234`

See more details on using hashes here.

File details

Details for the file pyEDAkit-0.1.4-py3-none-any.whl.

File metadata

Download URL: pyEDAkit-0.1.4-py3-none-any.whl
Upload date: Jan 18, 2025
Size: 36.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for pyEDAkit-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61d99f5afd97bda7f6051040bc9d6d8ac1bdb5a7fec1c8327397d391ce70d73e`
MD5	`437e5556faac611a2bb8ad787afa1cd0`
BLAKE2b-256	`d2d8ae9760b6758370a7b1654481bcb2b11dcc4617f5e836545d78b2d1872877`

See more details on using hashes here.

pyEDAkit 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Exploratory Data Analysis (EDA) Python Toolkit

Installation Instructions

Warnings

Overview

Features

1. Clustering and Dimensionality Reduction

2. Graph Algorithms

3. Intrinsic Dimensionality Estimation

4. Normalization

5. Clustering Quality Metrics

Examples

Clustering

Linkage Function

Python Code:

Visualizations

3D Scatter Plot of Hierarchical Clustering

Dendrogram

Cluster Function

Example:

Output:

Key Points:

K-means Clustering

Example:

Centroids displacement plot

Bash Output:

Key Points:

minspantree - Minimum Spanning Tree

Example:

Visualization

Bash Output

Key Points

cophenet - Cophenetic Correlation Coefficient

Enhanced Example:

Explanation:

Expected Output:

Key Takeaways:

silhouette - Evaluating Clustering Quality

Example Usage

Explanation

Expected Output

Silhouette Plot Visualization

Notes

Silhouette Evaluation Example

Example Code:

Output and Plot

Example Output:

Plot:

How It Works

Dimensionality Reduction Examples

Dataset Preparation

1. Principal Component Analysis (PCA)

Output (plot=True)

2. Singular Value Decomposition (SVD)

Output (plot=True)

3. Non-negative Matrix Factorization (NMF)

Output (plot=True)

4. Factor Analysis (FA)

Output (plot=True)

5. Linear Discriminant Analysis (LDA)

Output (plot=True)

6. Random Projection

Output (plot=True)

Notes:

Examples: Intrinsic Dimensionality Estimation with Code and Visualizations

1. 3D Scene Analysis

2. 1D Helix

3. 3D Helix

Normalization Example

Dataset Overview

Original Data

Z-Scores with Zero Mean

`Linkage` Function

`Cluster` Function

`K-means` Clustering

`minspantree` - Minimum Spanning Tree

`cophenet` - Cophenetic Correlation Coefficient

`silhouette` - Evaluating Clustering Quality