A tool for analyzing and describing CSV files
Project description
DescribeCSV
A Python tool for analyzing and describing CSV files. It provides detailed information about file structure, data types, missing values, and statistical summaries. It defaults to producing a markdown description, but can also produce JSON. Perfect for initial data exploration and quality assessment of large CSV files.
Features
- Automatic encoding detection and handling
- Memory-efficient processing of large files through chunking
- Comprehensive column analysis including:
- Data types and structure
- Missing value detection and statistics
- Unique value counts and distributions
- Statistical summaries for numeric columns
- Most frequent values for categorical columns
- Smart detection of numeric data stored as strings
- Duplicate row detection and counting
- Detailed file metadata information
Installation
You can install describecsv using pip:
pip install describecsv
Or using uv for faster installation:
uv tool install describecsv
Usage
By default, describecsv will output a markdown file with a description of the CSV file.
describecsv path/to/your/your_file.csv
This will create a markdown file named your_file_details.md in the same directory as your CSV file.
You can also specify the output format:
describecsv path/to/your/your_file.csv --format json
describecsv path/to/your/your_file.csv --format markdown
Output Example
The tool generates a detailed markdown report. Here's a sample of what you'll get:
# CSV File Analysis
## File: your_file.csv
- **Directory:** /path/to/your
- **Size:** 125.4 MB
- **Encoding:** utf-8
- **Created Date:** 2024-02-21T10:30:00
- **Modified Date:** 2024-02-21T10:30:00
## Basic Statistics
- **Number of Rows:** 100000
- **Number of Columns:** 15
- **Total Cells:** 1500000
- **Missing Cells:** 1234 (0.82%)
- **Duplicate Rows:** 42 (0.042%)
## Column Analysis
### Column: age
- **Data Type:** int64
- **Unique Values:** 75
- **Missing Values:** 12 (0.012%)
- **Mean:** 34.5
- **Standard Deviation:** 12.8
- **Minimum Value:** 18.0
- **Maximum Value:** 99.0
- **Median:** 32
### Column: category
- **Data Type:** object
- **Unique Values:** 5
- **Missing Values:** 0 (0.0%)
- **Top 3 Values:**
- A: 45000
- B: 30000
- C: 25000
- **Mode:** A
- **Top 3 Values Percentage of Total:** 95.0%
The tool can also generate a detailed JSON report.
Features in Detail
Encoding Detection
- Automatically detects file encoding
- Handles common encodings (UTF-8, Latin-1, etc.)
- Provides fallback options for difficult files
Memory Efficiency
- Processes files in chunks
- Optimizes data types automatically
- Suitable for large CSV files
Data Quality Checks
- Identifies potential data type mismatches
- Reports duplicate rows and missing values
Statistical Analysis
- Comprehensive numeric column statistics
- Frequency analysis for categorical data
- Missing value patterns
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file describecsv-0.5.0.tar.gz.
File metadata
- Download URL: describecsv-0.5.0.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf4963a607896fa9db5b3ed3e559effe6382fed375ba29008bfc105025d079b1
|
|
| MD5 |
44cebb9e566a79ee1df3e4c17715a176
|
|
| BLAKE2b-256 |
d945b538964fd4b1f4f133b22ec51ef63d97bec077c103ee3ebc6a29c4dd1b26
|
File details
Details for the file describecsv-0.5.0-py3-none-any.whl.
File metadata
- Download URL: describecsv-0.5.0-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e56d94e5a04a446ec3982d64c4cb4530cc821b7b0c5e67d173f98d1e257202fb
|
|
| MD5 |
fc687a84e1a253f048381ad9b3027b43
|
|
| BLAKE2b-256 |
e9786515e544b8b7073462323f6feb788b7dc434a28fc3124d31fc7408c4fe0d
|