Fast & simple summary for large CSV files
Project description
csvinsight
Fast & simple summary for large CSV files
Free software: MIT license
Documentation: https://csvinsight.readthedocs.io.
Features
Calculates basic stats for each column: max, min, mean length; number of non-empty values
Calculates exact number of unique values and the top 20 most frequent values
Supports non-orthogonal data (list fields)
Works with very large files: does not load the entire CSV into memory
Fast splitting of CSVs into columns, one file per column
Multiprocessing-enabled
Example Usage
Given a CSV file:
bash-3.2$ cat tests/sampledata.csv name|age|fave_color Alexey|33|red;yellow Boris|31|blue Valentina|0|
you can obtain a CsvInsight report with:
bash-3.2$ csvi tests/sampledata.csv --list-fields fave_color CSV Insight Report Total # Rows: 3 Column counts: 3 columns -> 3 rows Report Format: Column Number. Column Header -> Uniques: # ; Fills: # ; Fill Rate: Field Length: min #, max #, average: Top n field values -> Dupe Counts 1. name -> Uniques: 3 ; Fills: 3 ; Fill Rate: 100.0% Field Length: min 5, max 9, avg 6.67 Counts Percent Field Value 1 33.33 % Valentina 1 33.33 % Boris 1 33.33 % Alexey 2. age -> Uniques: 3 ; Fills: 3 ; Fill Rate: 100.0% Field Length: min 1, max 2, avg 1.67 Counts Percent Field Value 1 33.33 % 33 1 33.33 % 31 1 33.33 % 0 3. fave_color -> Uniques: 4 ; Fills: 3 ; Fill Rate: 75.0% Field Length: min 0, max 6, avg 3.25 Counts Percent Field Value 1 25.00 % yellow 1 25.00 % red 1 25.00 % blue 1 25.00 % NULL
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.2.3 (2017-12-09)
Fix bug: Unicode column names now work under Py2
0.2.2 (2017-12-04)
Fix bug: Unicode characters no longer break CsvInsight on Py2
0.2.1 (2017-11-27)
Fix bug: opening gzipped files with Py3 now works
0.2.0 (2017-11-25)
Split files using gsplit and process them in parallel for faster processing
No longer work with streams; works exclusively with files
Get rid of csvi_summarize and csvi_split entry points
Integrated plumbum for cleaner pipelines
Fixed issue #11: added support for more CSV parameters via the –dialect option
Fixed issue #10: reading from empty files no longer raises StopIteration
Fixed issue #8: use the correct link to the GitHub project in the documentation
Fixed issue #2: implemented in-memory mode for smaller files
0.1.0 (2017-10-29)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.