a Python implementation of the KMeans clustering algorithm which includes support for handling missing values in the dataset
Project description
KmeansWithNulls (KWN) Clustering for Multivariate Implementation
This repository contains a Python implementation of the KMeans clustering algorithm which includes support for handling missing values in multivariate dataset. The implementation allows for specifying the number of clusters, maximum iterations, and the random state for reproducibility.
Features
- KMeans clustering.
- Handles missing data (NaN values) in the dataset.
- Customizable number of clusters and iterations.
- Utilizes NumPy for efficient numerical computations.
Installation
No installation is required, just clone this repository using the following command:
git clone https://github.com/aasedek/KmeansWithNulls.git
Usage
To use the KmeansWithNulls class, import it into your Python script and create an instance of the class. Then call the fit method with your dataset:
from KmeansWithNulls import KmeansWithNulls
import numpy as np
# Example dataset
data = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Initialize the KmeansWithNulls class
kmeans_with_nulls = KmeansWithNulls(n_clusters=2, max_iter=300, random_state=42)
# Fit the model to your data
kmeans_with_nulls.fit(data)
# Predict the clusters
labels = kmeans_with_nulls.predict(data)
Example:
import numpy as np
import pandas as pd
import KmeansWithNulls
# Create a synthetic dataset with 100 points and 2 features
np.random.seed(42)
X_demo = np.random.rand(100, 2) * 10 # Scale the features by 10 for better visualization
# Introduce NaN values randomly in the dataset
nan_indices = np.random.choice(np.arange(X_demo.size), replace=False, size=10)
X_demo.ravel()[nan_indices] = np.nan
# Convert the array to a DataFrame for a better view of the NaN values
X_demo_df = pd.DataFrame(X_demo, columns=['Feature1', 'Feature2'])
# Instantiate the KMeansWithNulls class
kmeans_with_nulls = KmeansWithNulls(n_clusters=3)
# Fit the model to the data with NaNs
kmeans_with_nulls.fit(X_demo)
# Now let's predict the cluster labels for the dataset
predictions_with_nulls = kmeans_with_nulls.predict(X_demo)
# Output the centroids and predictions
centroids_with_nulls = kmeans_with_nulls.centroids
predictions_with_nulls
# Display the centroids and the first few predictions
print("Centroids:\n", centroids_with_nulls)
print("\nFirst 10 Predictions:\n", predictions_with_nulls[:10])
# Print the first few rows of the DataFrame with NaN values
X_demo_df.head()
# Plotting results
import matplotlib.pyplot as plt
# Plotting the clusters with nulls
plt.figure(figsize=(8, 6))
# Plot each cluster with available data, ignoring NaNs
for i in range(kmeans_with_nulls.n_clusters):
# Select data points without NaNs that are assigned to the current cluster
cluster_data = X_demo[~np.isnan(X_demo).any(axis=1) & (predictions_with_nulls == i)]
plt.scatter(cluster_data[:, 0], cluster_data[:, 1], label=f'Cluster {i+1}', s=50)
# Plot the centroids, excluding NaNs for plotting purposes
for idx, centroid in enumerate(centroids_with_nulls):
if not np.isnan(centroid).any():
plt.scatter(centroid[0], centroid[1], c='black', s=200, marker='x', label=f'Centroid {idx+1}')
plt.title('K-Means Clustering with Nulls')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
Contributing
Contributions to improve this implementation are welcome. Before creating a pull request, please ensure your code follows the existing code structure and style.
License
This project is open-sourced under the GNU General Public License v3.0. See the LICENSE file for details.
Contact
For questions and feedback, please reach out to me at arthur.sedek@gmail.com
Acknowledgements
This implementation was inspired by the KMeans algorithm from scikit-learn.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kmeanswithnulls-0.1.1-py3-none-any.whl.
File metadata
- Download URL: kmeanswithnulls-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a274f51e1c733e286947aed0d531dc31ea3d90027c5ad710773cb7d355f1dfd7
|
|
| MD5 |
093371e69a5f468a12efe043475ecbac
|
|
| BLAKE2b-256 |
b9afea65db00304bc84bdfe44d1f452e630962f788f586986573fcbee7e32616
|