A Python package for automatic EDA, data cleaning, and visualization.
Project description
pydatalens
pydatalens is a Python package designed to streamline the process of Exploratory Data Analysis (EDA), data cleaning, and visualization. It enables data scientists and analysts to quickly prepare, explore, and gain insights from datasets with minimal effort.
Features
1. Smart Summarization
- Automatically generates a summary of the dataset, including:
- Data types
- Missing values
- Descriptive statistics
- Unique value counts
2. Data Cleaning
- Detects and handles missing values using various strategies (mean, median, mode).
- Identifies and removes duplicate rows.
- Supports basic outlier detection (planned for future updates).
3. Correlation Analysis
- Generates a correlation matrix to identify relationships between features.
- Provides heatmaps for better visualization.
4. Automatic Visualizations
- Supports generating:
- Histograms
- Box plots
- Correlation heatmaps
- Scatter plots (planned for future updates).
5. Report Generation
- Exports EDA results and visualizations into a detailed HTML report for easy sharing.
Installation
Using pip (from source)
- Clone the repository:
git clone https://github.com/gopalakrishnanarjun/pydatalens.git cd pydatalens
- Install the package:
pip install -e .
Dependencies
- Python >= 3.6
- pandas >= 1.0
- numpy >= 1.18
- matplotlib >= 3.1
- seaborn >= 0.11
Install dependencies manually:
pip install pandas numpy matplotlib seaborn
Quick Start
1. Import the package
from pydatalens import eda, cleaning, visualizations
2. Load a dataset
import pandas as pd
df = pd.read_csv("your_dataset.csv")
3. Summarize the dataset
print(eda.summarize(df))
4. Handle missing values
df_cleaned = cleaning.handle_missing(df, strategy="mean")
5. Visualize the data
visualizations.plot_histogram(df_cleaned, column="age")
visualizations.correlation_heatmap(df_cleaned)
Examples
Summarizing the Data
from pydatalens import eda
summary = eda.summarize(df)
print(summary)
Cleaning the Data
from pydatalens import cleaning
df = cleaning.handle_missing(df, strategy="median")
df = cleaning.drop_duplicates(df)
Visualizing the Data
from pydatalens import visualizations
visualizations.plot_histogram(df, "column_name")
visualizations.correlation_heatmap(df)
Future Enhancements
- Advanced anomaly detection.
- Support for time series analysis.
- Enhanced visualization options (e.g., scatter plots, pair plots).
- Integration with machine learning pipelines.
Contributing
Contributions are welcome! If you'd like to contribute, please fork the repository and submit a pull request.
License
pydatalens is licensed under the MIT License. See the LICENSE file for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydatalens-0.0.4.tar.gz.
File metadata
- Download URL: pydatalens-0.0.4.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cdab3bc9268745ca7b332e967a3841495055c25dda1b7697cc72856a2b072ef
|
|
| MD5 |
e44f7e082a74c6382f12b4e9198f3191
|
|
| BLAKE2b-256 |
0f4075176113dd43c865ed5134ef674cd459f069a795f2e28f4ea4ebf4a7c20c
|
File details
Details for the file pydatalens-0.0.4-py3-none-any.whl.
File metadata
- Download URL: pydatalens-0.0.4-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b664b856c265c3d1b991d62f237b5d90d7a22ffa1331f90198d7a4320143694e
|
|
| MD5 |
2d48d7b8e4006dc469c47e24b2e43d90
|
|
| BLAKE2b-256 |
a250ccd5b76ad040d7b0080db241dc8bdc8c888278e2a6e2538747b05c61165d
|