Comprehensive automated CSV data analysis with statistical insights and visualizations
Project description
AutoCSV Profiler
A comprehensive toolkit for automated CSV data analysis providing statistical insights, data quality assessment, and interactive visualizations.
Features
- Comprehensive Statistical Analysis: Descriptive statistics, distributions, and data summaries
- Data Quality Assessment: Missing value analysis, outlier detection, and duplicate identification
- Advanced Visualizations: Box plots, histograms, correlation matrices, and KDE plots
- Interactive Reports: HTML reports with detailed insights and recommendations
- Command-Line Interface: Easy-to-use CLI for immediate analysis
- Python API: Programmatic access for integration into data pipelines
Installation
pip install autocsv-profiler
Quick Start
Command Line Usage
# Basic analysis
autocsv-profiler data.csv
# Specify output directory
autocsv-profiler data.csv --output ./my_analysis
# Custom delimiter
autocsv-profiler data.csv --delimiter ";"
Python API Usage
from autocsv_profiler import auto_csv_profiler
# Run comprehensive analysis
auto_csv_profiler.main("data.csv", "output_directory")
# Or import specific functions
from autocsv_profiler.recognize_delimiter import detect_delimiter
delimiter = detect_delimiter("data.csv")
print(f"Detected delimiter: {delimiter}")
Analysis Workflow
Generated Outputs
Statistical Reports
- Dataset Overview: Shape, data types, memory usage
- Descriptive Statistics: Mean, median, mode, standard deviation
- Distribution Analysis: Skewness, kurtosis, normality tests
- Categorical Analysis: Frequency tables and unique value counts
Data Quality Assessment
- Missing Values: Patterns, counts, and visualizations
- Outliers: IQR-based detection with statistical summaries
- Duplicates: Identification and detailed reporting
- Data Consistency: Type validation and integrity checks
Visualizations
- Distribution Plots: Histograms with KDE overlays
- Box Plots: Outlier visualization and quartile analysis
- Correlation Analysis: Heatmaps and relationship matrices
- Missing Data Patterns: Matrix plots and summary charts
Interactive Reports
- HTML Dashboard: Comprehensive overview with navigation
- Data Dictionary: Detailed variable descriptions
- Quality Summary: Actionable insights and recommendations
Output Structure
your_file_analysis/
├── your_file.csv # Copy of original data
├── dataset_info.txt # Basic dataset information
├── summary_statistics_all.txt # Comprehensive statistics
├── categorical_summary.txt # Categorical variable analysis
├── missing_values_report.txt # Missing data analysis
├── outliers_summary.txt # Outlier detection results
├── distinct_values_count_by_dtype.html # Interactive value explorer
└── visualization/ # Generated plots and charts
├── box_plots/
├── histograms/
└── correlation_matrices/
Advanced Features
Missing Value Analysis
- Automatic detection of missing value patterns
- Visualization of missing data distribution
- Imputation suggestions and options
- Missing value correlation analysis
Outlier Detection
- IQR-based outlier identification
- Statistical summaries for outliers
- Visual outlier highlighting in plots
- Outlier impact assessment
Statistical Testing
- Normality tests (Shapiro-Wilk)
- Correlation analysis (Pearson, Spearman)
- Chi-square tests for categorical variables
- Variance inflation factor (VIF) analysis
Relationship Analysis
- Variable correlation matrices
- Target variable analysis (if specified)
- Feature importance insights
- Interaction effect detection
Examples
Basic CSV Analysis
import autocsv_profiler
# Analyze sales data
autocsv_profiler.main("sales_data.csv", "sales_analysis")
Custom Analysis Pipeline
from autocsv_profiler import auto_csv_profiler
from autocsv_profiler.recognize_delimiter import detect_delimiter
import pandas as pd
# Load and analyze data
delimiter = detect_delimiter("customer_data.csv")
df = pd.read_csv("customer_data.csv", delimiter=delimiter)
# Run comprehensive analysis
auto_csv_profiler.main("customer_data.csv", "customer_analysis")
Batch Processing
import os
from autocsv_profiler import auto_csv_profiler
# Analyze all CSV files in a directory
for filename in os.listdir("data/"):
if filename.endswith(".csv"):
input_file = f"data/{filename}"
output_dir = f"analysis/{filename[:-4]}_results"
auto_csv_profiler.main(input_file, output_dir)
Requirements
- Python 3.9 or higher
- pandas >= 1.5.0
- numpy >= 1.24.0
- matplotlib >= 3.6.0
- seaborn >= 0.12.0
- scipy >= 1.10.0
- scikit-learn >= 1.2.0
- statsmodels >= 0.13.0
All dependencies are automatically installed with pip.
Performance Tips
- Large Files: For files > 100MB, consider sampling first
- Memory Usage: Monitor memory for datasets with many categorical variables
- Output Management: Clean old analysis directories to save disk space
- Parallel Processing: Use batch scripts for multiple files
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- Issues: GitHub Issues
- Documentation: GitHub Docs
- Changelog: CHANGELOG.md
Version
Current version: 1.1.0
See CHANGELOG.md for version history and updates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autocsv_profiler-1.1.0.tar.gz.
File metadata
- Download URL: autocsv_profiler-1.1.0.tar.gz
- Upload date:
- Size: 60.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7e0c3002c7c3e8fe58dd2425c226e5168ffbe5ce86aba6c0775bc2b6b4244c1
|
|
| MD5 |
b6a66e6e60b39a3151c68023fd8c47df
|
|
| BLAKE2b-256 |
537fedbc5be51567e380519ee7f1bb0308e5e4a7d0ab888d27c92969e1f0edd0
|
File details
Details for the file autocsv_profiler-1.1.0-py3-none-any.whl.
File metadata
- Download URL: autocsv_profiler-1.1.0-py3-none-any.whl
- Upload date:
- Size: 41.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
355bbcceb5edc15a9841aa6b8c427867b6981c36dec4815865a48ace26777558
|
|
| MD5 |
2c29c47669673837ff0cd443addda983
|
|
| BLAKE2b-256 |
3f8febe85d7261f1f8c32c141575b5a359edbcbad98cbc97e6b6c132f02c4c73
|