Skip to main content

A Python package for automatic EDA, data cleaning, and visualization.

Project description

pydatalens

pydatalens is a Python package designed to streamline the process of Exploratory Data Analysis (EDA), data cleaning, and visualization. It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort.


Features

1. Smart Summarization

  • Automatically generates a summary of the dataset, including:
    • Data types
    • Missing values
    • Descriptive statistics
    • Unique value counts

2. Data Cleaning

  • Detects and handles missing values using various strategies (mean, median, mode).
  • Identifies and removes duplicate rows.
  • Supports basic outlier detection (planned for future updates).

3. Correlation Analysis

  • Generates a correlation matrix to identify relationships between features.
  • Provides heatmaps for better visualization.

4. Automatic Visualizations

  • Supports generating:
    • Histograms
    • Box plots
    • Correlation heatmaps
    • Scatter plots (planned for future updates).

5. Report Generation

  • Exports EDA results and visualizations into a detailed HTML report for easy sharing.

Installation

Using pip (from source)

  1. Clone the repository:
    git clone https://github.com/gopalakrishnanarjun/pydatalens.git
    cd pydatalens
    
  2. Install the package:
    pip install -e .
    

Dependencies

  • Python >= 3.6
  • pandas >= 1.0
  • numpy >= 1.18
  • matplotlib >= 3.1
  • seaborn >= 0.11

Install dependencies manually:

pip install pandas numpy matplotlib seaborn

Quick Start

1. Import the package

from pydatalens import eda, cleaning, visualizations

2. Load a dataset

import pandas as pd
df = pd.read_csv("your_dataset.csv")

3. Summarize the dataset

print(eda.summarize(df))

4. Handle missing values

df_cleaned = cleaning.handle_missing(df, strategy="mean")

5. Visualize the data

visualizations.plot_histogram(df_cleaned, column="age")
visualizations.correlation_heatmap(df_cleaned)

Examples

Summarizing the Data

from pydatalens import eda
summary = eda.summarize(df)
print(summary)

Cleaning the Data

from pydatalens import cleaning
df = cleaning.handle_missing(df, strategy="median")
df = cleaning.drop_duplicates(df)

Visualizing the Data

from pydatalens import visualizations
visualizations.plot_histogram(df, "column_name")
visualizations.correlation_heatmap(df)

Future Enhancements

  • Advanced anomaly detection.
  • Support for time series analysis.
  • Enhanced visualization options (e.g., scatter plots, pair plots).
  • Integration with machine learning pipelines.

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request.


License

pydatalens is licensed under the MIT License. See the LICENSE file for more details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydatalens-0.0.1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydatalens-0.0.1-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file pydatalens-0.0.1.tar.gz.

File metadata

  • Download URL: pydatalens-0.0.1.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-0.0.1.tar.gz
Algorithm Hash digest
SHA256 bd6e3b3057f50c9a53c2aebceb41cda71a1beef98bbcd073690288d084e5d74e
MD5 97ee1a2796e38a46994395a49c9d4973
BLAKE2b-256 fc098aebd3c30f285756b6e36f491d3f38236d6bd817badd193755c46fcc26fd

See more details on using hashes here.

File details

Details for the file pydatalens-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pydatalens-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for pydatalens-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1af6ae015955158d9d4b8ca8449aefd42b2e2cea4fd9a1be8d84ed4a2864e9e0
MD5 93ed7224c5765014e56375c4600d674d
BLAKE2b-256 d38fd9a985142a30307d82e11a327016ccc3cf2c482c584388d2b035c87c591b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page