Skip to main content

A Python package for automatic EDA, data cleaning, and visualization.

Project description

pydatalens

pydatalens is a Python package designed to streamline the process of Exploratory Data Analysis (EDA), data cleaning, and visualization. It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort.


Features

1. Smart Summarization

  • Automatically generates a summary of the dataset, including:
    • Data types
    • Missing values
    • Descriptive statistics
    • Unique value counts

2. Data Cleaning

  • Detects and handles missing values using various strategies (mean, median, mode).
  • Identifies and removes duplicate rows.
  • Supports basic outlier detection (planned for future updates).

3. Correlation Analysis

  • Generates a correlation matrix to identify relationships between features.
  • Provides heatmaps for better visualization.

4. Automatic Visualizations

  • Supports generating:
    • Histograms
    • Box plots
    • Correlation heatmaps
    • Scatter plots (planned for future updates).

5. Report Generation

  • Exports EDA results and visualizations into a detailed HTML report for easy sharing.

Installation

Using pip (from source)

  1. Clone the repository:
    git clone https://github.com/gopalakrishnanarjun/pydatalens.git
    cd pydatalens
    
  2. Install the package:
    pip install -e .
    

Dependencies

  • Python >= 3.6
  • pandas >= 1.0
  • numpy >= 1.18
  • matplotlib >= 3.1
  • seaborn >= 0.11

Install dependencies manually:

pip install pandas numpy matplotlib seaborn

Quick Start

1. Import the package

from pydatalens import eda, cleaning, visualizations

2. Load a dataset

import pandas as pd
df = pd.read_csv("your_dataset.csv")

3. Summarize the dataset

print(eda.summarize(df))

4. Handle missing values

df_cleaned = cleaning.handle_missing(df, strategy="mean")

5. Visualize the data

visualizations.plot_histogram(df_cleaned, column="age")
visualizations.correlation_heatmap(df_cleaned)

Examples

Summarizing the Data

from pydatalens import eda
summary = eda.summarize(df)
print(summary)

Cleaning the Data

from pydatalens import cleaning
df = cleaning.handle_missing(df, strategy="median")
df = cleaning.drop_duplicates(df)

Visualizing the Data

from pydatalens import visualizations
visualizations.plot_histogram(df, "column_name")
visualizations.correlation_heatmap(df)

Future Enhancements

  • Advanced anomaly detection.
  • Support for time series analysis.
  • Enhanced visualization options (e.g., scatter plots, pair plots).
  • Integration with machine learning pipelines.

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request.


License

pydatalens is licensed under the MIT License. See the LICENSE file for more details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatalens-0.0.4.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydatalens-0.0.4-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file pydatalens-0.0.4.tar.gz.

File metadata

  • Download URL: pydatalens-0.0.4.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-0.0.4.tar.gz
Algorithm Hash digest
SHA256 2cdab3bc9268745ca7b332e967a3841495055c25dda1b7697cc72856a2b072ef
MD5 e44f7e082a74c6382f12b4e9198f3191
BLAKE2b-256 0f4075176113dd43c865ed5134ef674cd459f069a795f2e28f4ea4ebf4a7c20c

See more details on using hashes here.

File details

Details for the file pydatalens-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: pydatalens-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b664b856c265c3d1b991d62f237b5d90d7a22ffa1331f90198d7a4320143694e
MD5 2d48d7b8e4006dc469c47e24b2e43d90
BLAKE2b-256 a250ccd5b76ad040d7b0080db241dc8bdc8c888278e2a6e2538747b05c61165d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page