Skip to main content

Comprehensive automated CSV data analysis with statistical insights and visualizations

Project description

AutoCSV Profiler

A comprehensive toolkit for automated CSV data analysis providing statistical insights, data quality assessment, and interactive visualizations.

PyPI version Python Support License: MIT

Features

  • Comprehensive Statistical Analysis: Descriptive statistics, distributions, and data summaries
  • Data Quality Assessment: Missing value analysis, outlier detection, and duplicate identification
  • Advanced Visualizations: Box plots, histograms, correlation matrices, and KDE plots
  • Interactive Reports: HTML reports with detailed insights and recommendations
  • Command-Line Interface: Easy-to-use CLI for immediate analysis
  • Python API: Programmatic access for integration into data pipelines

Installation

pip install autocsv-profiler

Quick Start

Command Line Usage

# Basic analysis
autocsv-profiler data.csv

# Specify output directory
autocsv-profiler data.csv --output ./my_analysis

# Custom delimiter
autocsv-profiler data.csv --delimiter ";"

Python API Usage

from autocsv_profiler import auto_csv_profiler

# Run comprehensive analysis
auto_csv_profiler.main("data.csv", "output_directory")

# Or import specific functions
from autocsv_profiler.recognize_delimiter import detect_delimiter

delimiter = detect_delimiter("data.csv")
print(f"Detected delimiter: {delimiter}")

Analysis Workflow

Analysis Workflow

Generated Outputs

Statistical Reports

  • Dataset Overview: Shape, data types, memory usage
  • Descriptive Statistics: Mean, median, mode, standard deviation
  • Distribution Analysis: Skewness, kurtosis, normality tests
  • Categorical Analysis: Frequency tables and unique value counts

Data Quality Assessment

  • Missing Values: Patterns, counts, and visualizations
  • Outliers: IQR-based detection with statistical summaries
  • Duplicates: Identification and detailed reporting
  • Data Consistency: Type validation and integrity checks

Visualizations

  • Distribution Plots: Histograms with KDE overlays
  • Box Plots: Outlier visualization and quartile analysis
  • Correlation Analysis: Heatmaps and relationship matrices
  • Missing Data Patterns: Matrix plots and summary charts

Interactive Reports

  • HTML Dashboard: Comprehensive overview with navigation
  • Data Dictionary: Detailed variable descriptions
  • Quality Summary: Actionable insights and recommendations

Output Structure

your_file_analysis/
├── your_file.csv                     # Copy of original data
├── dataset_info.txt                  # Basic dataset information
├── summary_statistics_all.txt        # Comprehensive statistics
├── categorical_summary.txt           # Categorical variable analysis
├── missing_values_report.txt         # Missing data analysis
├── outliers_summary.txt              # Outlier detection results
├── distinct_values_count_by_dtype.html # Interactive value explorer
└── visualization/                    # Generated plots and charts
    ├── box_plots/
    ├── histograms/
    └── correlation_matrices/

Advanced Features

Missing Value Analysis

  • Automatic detection of missing value patterns
  • Visualization of missing data distribution
  • Imputation suggestions and options
  • Missing value correlation analysis

Outlier Detection

  • IQR-based outlier identification
  • Statistical summaries for outliers
  • Visual outlier highlighting in plots
  • Outlier impact assessment

Statistical Testing

  • Normality tests (Shapiro-Wilk)
  • Correlation analysis (Pearson, Spearman)
  • Chi-square tests for categorical variables
  • Variance inflation factor (VIF) analysis

Relationship Analysis

  • Variable correlation matrices
  • Target variable analysis (if specified)
  • Feature importance insights
  • Interaction effect detection

Examples

Basic CSV Analysis

import autocsv_profiler

# Analyze sales data
autocsv_profiler.main("sales_data.csv", "sales_analysis")

Custom Analysis Pipeline

from autocsv_profiler import auto_csv_profiler
from autocsv_profiler.recognize_delimiter import detect_delimiter
import pandas as pd

# Load and analyze data
delimiter = detect_delimiter("customer_data.csv")
df = pd.read_csv("customer_data.csv", delimiter=delimiter)

# Run comprehensive analysis
auto_csv_profiler.main("customer_data.csv", "customer_analysis")

Batch Processing

import os
from autocsv_profiler import auto_csv_profiler

# Analyze all CSV files in a directory
for filename in os.listdir("data/"):
    if filename.endswith(".csv"):
        input_file = f"data/{filename}"
        output_dir = f"analysis/{filename[:-4]}_results"
        auto_csv_profiler.main(input_file, output_dir)

Requirements

  • Python 3.9 or higher
  • pandas >= 1.5.0
  • numpy >= 1.24.0
  • matplotlib >= 3.6.0
  • seaborn >= 0.12.0
  • scipy >= 1.10.0
  • scikit-learn >= 1.2.0
  • statsmodels >= 0.13.0

All dependencies are automatically installed with pip.

Performance Tips

  • Large Files: For files > 100MB, consider sampling first
  • Memory Usage: Monitor memory for datasets with many categorical variables
  • Output Management: Clean old analysis directories to save disk space
  • Parallel Processing: Use batch scripts for multiple files

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Version

Current version: 1.1.0

See CHANGELOG.md for version history and updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocsv_profiler-1.1.0.tar.gz (60.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autocsv_profiler-1.1.0-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file autocsv_profiler-1.1.0.tar.gz.

File metadata

  • Download URL: autocsv_profiler-1.1.0.tar.gz
  • Upload date:
  • Size: 60.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for autocsv_profiler-1.1.0.tar.gz
Algorithm Hash digest
SHA256 a7e0c3002c7c3e8fe58dd2425c226e5168ffbe5ce86aba6c0775bc2b6b4244c1
MD5 b6a66e6e60b39a3151c68023fd8c47df
BLAKE2b-256 537fedbc5be51567e380519ee7f1bb0308e5e4a7d0ab888d27c92969e1f0edd0

See more details on using hashes here.

File details

Details for the file autocsv_profiler-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for autocsv_profiler-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 355bbcceb5edc15a9841aa6b8c427867b6981c36dec4815865a48ace26777558
MD5 2c29c47669673837ff0cd443addda983
BLAKE2b-256 3f8febe85d7261f1f8c32c141575b5a359edbcbad98cbc97e6b6c132f02c4c73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page