Skip to main content

A Python package for automatic EDA, data cleaning, and visualization.

Project description

pydatalens

pydatalens is a Python package designed to streamline the process of Exploratory Data Analysis (EDA), data cleaning, and visualization. It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort.


Features

1. Smart Summarization

  • Automatically generates a summary of the dataset, including:
    • Data types
    • Missing values
    • Descriptive statistics
    • Unique value counts

2. Data Cleaning

  • Detects and handles missing values using various strategies (mean, median, mode).
  • Identifies and removes duplicate rows.
  • Supports basic outlier detection (planned for future updates).

3. Correlation Analysis

  • Generates a correlation matrix to identify relationships between features.
  • Provides heatmaps for better visualization.

4. Automatic Visualizations

  • Supports generating:
    • Histograms
    • Box plots
    • Correlation heatmaps
    • Scatter plots (planned for future updates).

5. Report Generation

  • Exports EDA results and visualizations into a detailed HTML report for easy sharing.

Installation

Using pip (from source)

  1. Clone the repository:
    git clone https://github.com/gopalakrishnanarjun/pydatalens.git
    cd pydatalens
    
  2. Install the package:
    pip install -e .
    

Dependencies

  • Python >= 3.6
  • pandas >= 1.0
  • numpy >= 1.18
  • matplotlib >= 3.1
  • seaborn >= 0.11

Install dependencies manually:

pip install pandas numpy matplotlib seaborn

Quick Start

1. Import the package

from pydatalens import eda, cleaning, visualizations

2. Load a dataset

import pandas as pd
df = pd.read_csv("your_dataset.csv")

3. Summarize the dataset

print(eda.summarize(df))

4. Handle missing values

df_cleaned = cleaning.handle_missing(df, strategy="mean")

5. Visualize the data

visualizations.plot_histogram(df_cleaned, column="age")
visualizations.correlation_heatmap(df_cleaned)

Examples

Summarizing the Data

from pydatalens import eda
summary = eda.summarize(df)
print(summary)

Cleaning the Data

from pydatalens import cleaning
df = cleaning.handle_missing(df, strategy="median")
df = cleaning.drop_duplicates(df)

Visualizing the Data

from pydatalens import visualizations
visualizations.plot_histogram(df, "column_name")
visualizations.correlation_heatmap(df)

Future Enhancements

  • Advanced anomaly detection.
  • Support for time series analysis.
  • Enhanced visualization options (e.g., scatter plots, pair plots).
  • Integration with machine learning pipelines.

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request.


License

pydatalens is licensed under the MIT License. See the LICENSE file for more details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatalens-1.0.0.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydatalens-1.0.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file pydatalens-1.0.0.tar.gz.

File metadata

  • Download URL: pydatalens-1.0.0.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-1.0.0.tar.gz
Algorithm Hash digest
SHA256 11c25cd2c4f179e3e8597d88d0ad1843b75a3517bfb840f3da85e90b94da3ead
MD5 e9b622c322c143e3778247675c395b2f
BLAKE2b-256 7cd4e2c9e8eb26325420df61516b967bd80006da9078fca3200b0ed66cb627ce

See more details on using hashes here.

File details

Details for the file pydatalens-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pydatalens-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4989f14251e98ee09791eedb74c1c4f90c8420786fd265d3240bb4ea8b69126
MD5 7944cf3f71918894afbafd8a41bc9772
BLAKE2b-256 e9b14b771394e070204b1e1c30fe5f58514ec1e3cbc0c2311a11e68cf9e4e967

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page