A tool for analyzing and describing CSV files
Project description
DescribeCSV
A Python tool for analyzing and describing CSV files. It provides detailed information about file structure, data types, missing values, and statistical summaries. Perfect for initial data exploration and quality assessment of large CSV files.
Features
- Automatic encoding detection and handling
- Memory-efficient processing of large files through chunking
- Comprehensive column analysis including:
- Data types and structure
- Missing value detection and statistics
- Unique value counts and distributions
- Statistical summaries for numeric columns
- Most frequent values for categorical columns
- Smart detection of numeric data stored as strings
- Duplicate row detection and counting
- Detailed file metadata information
Installation
You can install describecsv using pip:
pip install describecsv
Or using uv for faster installation:
uv tool install describecsv
Usage
From the command line:
describecsv path/to/your/your_file.csv
This will create a JSON file named your_file.json in the same directory as your CSV file.
Output Example
The tool generates a detailed JSON report. Here's a sample of what you'll get:
{
"basic_info": {
"file_info": {
"file_name": "your_file.csv",
"size_mb": 125.4,
"created_date": "2024-02-21T10:30:00",
"encoding": "utf-8"
},
"num_rows": 100000,
"num_columns": 15,
"missing_cells": 1234,
"missing_percentage": 0.82,
"duplicate_rows": 42,
"duplicate_percentage": 0.042
},
"column_analysis": {
"age": {
"data_type": "int64",
"unique_value_count": 75,
"missing_value_count": 12,
"mean_value": 34.5,
"std_dev": 12.8,
"min_value": 18.0,
"max_value": 99.0
},
"category": {
"data_type": "object",
"unique_value_count": 5,
"missing_value_count": 0,
"top_3_values": {
"A": 45000,
"B": 30000,
"C": 25000
},
"optimization_suggestion": "Consider using category dtype"
}
}
}
Features in Detail
Encoding Detection
- Automatically detects file encoding
- Handles common encodings (UTF-8, Latin-1, etc.)
- Provides fallback options for difficult files
Memory Efficiency
- Processes files in chunks
- Optimizes data types automatically
- Suitable for large CSV files
Data Quality Checks
- Identifies potential data type mismatches
- Suggests optimizations for categorical columns
- Reports duplicate rows and missing values
Statistical Analysis
- Comprehensive numeric column statistics
- Frequency analysis for categorical data
- Missing value patterns
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file describecsv-0.2.0.tar.gz.
File metadata
- Download URL: describecsv-0.2.0.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf66e56847bc262574bc898a543eb226373519dabb3a776f147d244b06228f1a
|
|
| MD5 |
ef8c941d744836be6a98bcb2aabd7449
|
|
| BLAKE2b-256 |
a66f824118a53c82303722b232dd21ab8a23754288b22cf8e63d7a62308a44f1
|
File details
Details for the file describecsv-0.2.0-py3-none-any.whl.
File metadata
- Download URL: describecsv-0.2.0-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf140344c20ebe913a4db06a91fdb7776f6f9f692f09a32bd6d99fffd9703c4b
|
|
| MD5 |
6022c910ab07054f70944466fdd27577
|
|
| BLAKE2b-256 |
55af022a4cb89e4d97f88899c130d0044da6766a36169e0422848f99019fc6cd
|