Skip to main content

A tool for analyzing and describing CSV files

Project description

DescribeCSV

PyPI version License: MIT Python 3.10+

A Python tool for analyzing and describing CSV files. It provides detailed information about file structure, data types, missing values, and statistical summaries. Perfect for initial data exploration and quality assessment of large CSV files.

Features

  • Automatic encoding detection and handling
  • Memory-efficient processing of large files through chunking
  • Comprehensive column analysis including:
    • Data types and structure
    • Missing value detection and statistics
    • Unique value counts and distributions
    • Statistical summaries for numeric columns
    • Most frequent values for categorical columns
  • Smart detection of numeric data stored as strings
  • Duplicate row detection and counting
  • Detailed file metadata information

Installation

You can install describecsv using pip:

pip install describecsv

Or using uv for faster installation:

uv pip install describecsv

Usage

From the command line:

describecsv path/to/your/file.csv

This will create a JSON file named your_file_details.json in the same directory as your CSV file.

Output Example

The tool generates a detailed JSON report. Here's a sample of what you'll get:

{
  "basic_info": {
    "file_info": {
      "file_name": "example.csv",
      "size_mb": 125.4,
      "created_date": "2024-02-21T10:30:00",
      "encoding": "utf-8"
    },
    "num_rows": 100000,
    "num_columns": 15,
    "missing_cells": 1234,
    "missing_percentage": 0.82,
    "duplicate_rows": 42,
    "duplicate_percentage": 0.042
  },
  "column_analysis": {
    "age": {
      "data_type": "int64",
      "unique_value_count": 75,
      "missing_value_count": 12,
      "mean_value": 34.5,
      "std_dev": 12.8,
      "min_value": 18.0,
      "max_value": 99.0
    },
    "category": {
      "data_type": "object",
      "unique_value_count": 5,
      "missing_value_count": 0,
      "top_3_values": {
        "A": 45000,
        "B": 30000,
        "C": 25000
      },
      "optimization_suggestion": "Consider using category dtype"
    }
  }
}

Features in Detail

Encoding Detection

  • Automatically detects file encoding
  • Handles common encodings (UTF-8, Latin-1, etc.)
  • Provides fallback options for difficult files

Memory Efficiency

  • Processes files in chunks
  • Optimizes data types automatically
  • Suitable for large CSV files

Data Quality Checks

  • Identifies potential data type mismatches
  • Suggests optimizations for categorical columns
  • Reports duplicate rows and missing values

Statistical Analysis

  • Comprehensive numeric column statistics
  • Frequency analysis for categorical data
  • Missing value patterns

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

549642f (build: Add PyPI packaging configuration and documentation) =======

DescribeCSV

A Python tool for analyzing and describing CSV files. It provides detailed information about file structure, data types, missing values, and statistical summaries.

Features

  • Automatic encoding detection
  • Handles large files through chunked processing
  • Detailed column analysis including:
    • Data types
    • Missing values
    • Unique value counts
    • Statistical summaries for numeric columns
    • Top values for categorical columns
  • Detection of numeric data stored as strings
  • Duplicate row detection
  • File metadata information

Installation

pip install describecsv

Usage

From the command line:

describecsv path/to/your/file.csv

This will create a JSON file with the analysis results in the same directory as your CSV file.

Output

The tool generates a detailed JSON report including:

  • Basic file information (size, encoding, etc.)
  • Row and column counts
  • Missing value analysis
  • Column-by-column analysis including:
    • Data types
    • Unique values
    • Missing values
    • Statistical summaries for numeric columns
    • Most common values for categorical columns
    • Suggestions for data quality improvements

License

This project is licensed under the MIT License - see the LICENSE file for details.

549642f (build: Add PyPI packaging configuration and documentation)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

describecsv-0.1.4.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

describecsv-0.1.4-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file describecsv-0.1.4.tar.gz.

File metadata

  • Download URL: describecsv-0.1.4.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for describecsv-0.1.4.tar.gz
Algorithm Hash digest
SHA256 c32da582fffbaaaeb76740556c0b2c601e5967cbd8ffabd0409733e195fbfc80
MD5 c74e644bb8d12cc09abfd74cd7d946f6
BLAKE2b-256 9f4b786fb4be1b59e1ac015867a90a5ec8667e1951f90845207452d88da6940a

See more details on using hashes here.

File details

Details for the file describecsv-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: describecsv-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for describecsv-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4e533d720c55454cbfe79969ba0e6a7ed39b4205e6506c245a1d5d1787341216
MD5 1e9b2c8a200fc6bfbce839e479b79df7
BLAKE2b-256 4845e8b1af8c72e6bbb4d270e31b6c5e3acaa7a8cec3b0aa6b4ebec6f311233a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page