Skip to main content

A tool for analyzing and describing CSV files

Project description

DescribeCSV

PyPI version License: MIT Python 3.10+

A Python tool for analyzing and describing CSV files. It provides detailed information about file structure, data types, missing values, and statistical summaries. It defaults to producing a markdown description, but can also produce JSON. Perfect for initial data exploration and quality assessment of large CSV files.

Features

  • Automatic encoding detection and handling
  • Memory-efficient processing of large files through chunking
  • Comprehensive column analysis including:
    • Data types and structure
    • Missing value detection and statistics
    • Unique value counts and distributions
    • Statistical summaries for numeric columns
    • Most frequent values for categorical columns
  • Smart detection of numeric data stored as strings
  • Duplicate row detection and counting
  • Detailed file metadata information

Installation

You can install describecsv using pip:

pip install describecsv

Or using uv for faster installation:

uv tool install describecsv

Usage

By default, describecsv will output a markdown file with a description of the CSV file.

describecsv path/to/your/your_file.csv

This will create a markdown file named your_file_details.md in the same directory as your CSV file.

You can also specify the output format:

describecsv path/to/your/your_file.csv --format json
describecsv path/to/your/your_file.csv --format markdown

Output Example

The tool generates a detailed markdown report. Here's a sample of what you'll get:

# CSV File Analysis

## File: your_file.csv

- **Directory:** /path/to/your
- **Size:** 125.4 MB
- **Encoding:** utf-8
- **Created Date:** 2024-02-21T10:30:00
- **Modified Date:** 2024-02-21T10:30:00

## Basic Statistics

- **Number of Rows:** 100000
- **Number of Columns:** 15
- **Total Cells:** 1500000
- **Missing Cells:** 1234 (0.82%)
- **Duplicate Rows:** 42 (0.042%)

## Column Analysis

### Column: age

- **Data Type:** int64
- **Unique Values:** 75
- **Missing Values:** 12 (0.012%)
- **Mean:** 34.5
- **Standard Deviation:** 12.8
- **Minimum Value:** 18.0
- **Maximum Value:** 99.0
- **Median:** 32

### Column: category

- **Data Type:** object
- **Unique Values:** 5
- **Missing Values:** 0 (0.0%)
- **Top 3 Values:**
  - A: 45000
  - B: 30000
  - C: 25000
- **Mode:** A
- **Top 3 Values Percentage of Total:** 95.0%

The tool can also generate a detailed JSON report.

Features in Detail

Encoding Detection

  • Automatically detects file encoding
  • Handles common encodings (UTF-8, Latin-1, etc.)
  • Provides fallback options for difficult files

Memory Efficiency

  • Processes files in chunks
  • Optimizes data types automatically
  • Suitable for large CSV files

Data Quality Checks

  • Identifies potential data type mismatches
  • Reports duplicate rows and missing values

Statistical Analysis

  • Comprehensive numeric column statistics
  • Frequency analysis for categorical data
  • Missing value patterns

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

describecsv-1.0.0.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

describecsv-1.0.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file describecsv-1.0.0.tar.gz.

File metadata

  • Download URL: describecsv-1.0.0.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for describecsv-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5a00457688c8e03030b580d2369f4800ef2717788f7c9aa0470c36f491d59a1a
MD5 5ccb42e300fc400745c1800da65d84b9
BLAKE2b-256 b6f6523783b0e17af300bbe13d4c92a60b832d61c7c63541f7f4614eb660ae7f

See more details on using hashes here.

File details

Details for the file describecsv-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: describecsv-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for describecsv-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 24a7dcd3e72d9ce3a7a266232903d6cf1619c70e26fecf00574fbdf093b510db
MD5 b076a8fa6c96bd76b3596c3a6515a338
BLAKE2b-256 b03413e4ff809b3e413f76d0d1fd23c4cf9cdcf384abf71da0936a0f32401fd4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page