Skip to main content

A Python package for automatic EDA, data cleaning, and visualization.

Project description

pydatalens

pydatalens is a Python package designed to streamline the process of Exploratory Data Analysis (EDA), data cleaning, and visualization. It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort.


Features

1. Smart Summarization

  • Automatically generates a summary of the dataset, including:
    • Data types
    • Missing values
    • Descriptive statistics
    • Unique value counts

2. Data Cleaning

  • Detects and handles missing values using various strategies (mean, median, mode).
  • Identifies and removes duplicate rows.
  • Supports basic outlier detection (planned for future updates).

3. Correlation Analysis

  • Generates a correlation matrix to identify relationships between features.
  • Provides heatmaps for better visualization.

4. Automatic Visualizations

  • Supports generating:
    • Histograms
    • Box plots
    • Correlation heatmaps
    • Scatter plots (planned for future updates).

5. Report Generation

  • Exports EDA results and visualizations into a detailed HTML report for easy sharing.

Installation

Using pip (from source)

  1. Clone the repository:
    git clone https://github.com/gopalakrishnanarjun/pydatalens.git
    cd pydatalens
    
  2. Install the package:
    pip install -e .
    

Dependencies

  • Python >= 3.6
  • pandas >= 1.0
  • numpy >= 1.18
  • matplotlib >= 3.1
  • seaborn >= 0.11

Install dependencies manually:

pip install pandas numpy matplotlib seaborn

Quick Start

1. Import the package

from pydatalens import eda, cleaning, visualizations

2. Load a dataset

import pandas as pd
df = pd.read_csv("your_dataset.csv")

3. Summarize the dataset

print(eda.summarize(df))

4. Handle missing values

df_cleaned = cleaning.handle_missing(df, strategy="mean")

5. Visualize the data

visualizations.plot_histogram(df_cleaned, column="age")
visualizations.correlation_heatmap(df_cleaned)

Examples

Summarizing the Data

from pydatalens import eda
summary = eda.summarize(df)
print(summary)

Cleaning the Data

from pydatalens import cleaning
df = cleaning.handle_missing(df, strategy="median")
df = cleaning.drop_duplicates(df)

Visualizing the Data

from pydatalens import visualizations
visualizations.plot_histogram(df, "column_name")
visualizations.correlation_heatmap(df)

Future Enhancements

  • Advanced anomaly detection.
  • Support for time series analysis.
  • Enhanced visualization options (e.g., scatter plots, pair plots).
  • Integration with machine learning pipelines.

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request.


License

pydatalens is licensed under the MIT License. See the LICENSE file for more details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatalens-0.0.2.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydatalens-0.0.2-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file pydatalens-0.0.2.tar.gz.

File metadata

  • Download URL: pydatalens-0.0.2.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-0.0.2.tar.gz
Algorithm Hash digest
SHA256 77ba7b980509ff101d572339a47ea071a9ca2a2bd4adf889b2b74ad6ac22ba3f
MD5 11ffcf44e8638294653e901cd0d6b353
BLAKE2b-256 7419d15636dd4fff1e76d1fc57978e251a6b3b8999194d248c4ea8bb8f99e121

See more details on using hashes here.

File details

Details for the file pydatalens-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pydatalens-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5da098b4f42cba487e3a10ac780ce542ce80405ad8cec426396d4e4cedb6b0be
MD5 6b4cf41f3a965fdd3fc9cecc3c8b767b
BLAKE2b-256 26e8dbc9ba1cb57f4b20e8dae82ed6495ef9b2a4e7d3b944eeb1c776a899c65d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page