Skip to main content

Python command line application to summarize a CSV or TSV dataset.

Project description

dfsummarizer

License: MIT build PyPI Documentation Status

This is an application to summarize the variables in a data frame. It will accept a CSV, TSV or XLS file and produce a table summarizing all columns individually.

This was motivated by the fact that the summary function for a pandas data frame ignores all non-numeric columns, and does not contain multiple common analytical considerations: how many unique values, how many missing values, min and max dates, min, mean and max string lengths.

Output can be generated as either Latex or Markdown.

Released and distributed via setuptools/PyPI/pip for Python 3.

Additional detail available in the companion blog post

Notes

Initial implementation can handle larger files by chunking data and iteratively building statistics. All statistics are robust except for estimation of the proportion of unique values. We have used a simple implementation of the Flajolet Martin algorithm based on the implementation by Javia Jinkal

This review article by Phillip Gibbons gives a great overview of the alternatives.

Usage

You can use this application multiple ways

Use the runner:

./dfsummarizer-runner.py markdown data/test.csv > markdown_test.md

Which was used to generate the markdown output test file

Invoke the directory as a package:

python -m dfsummarizer markdown data/test.csv

Or simply install the package and use the command line application directly

Installation

Installation from the source tree:

python setup.py install

(or via pip from PyPI):

pip install dfsummarizer

Now, the dfsummarizer command is available::

dfsummarizer markdown test.csv

This will produce a markdown table summarizing the contents of the CSV file test.csv

Acknowledgements

Python package built using the bootstrap cmdline template by jgehrcke

Name Type Unique Vals Nulls Mode Min Mean Max
id Char 6 0.0% S001 4 4.0 4
opening Date 6 0.0% 2019-01-01 00:00:00 2019-01-01 2019-04-18 2019-07-12
first Bool 2 16.7% NO 0.0 0.4 1.0
last Bool 2 50.0% NaN 0 0.333 1
state Char 3 16.7% NSW 3.0 3.0 3.0
balance Float 5 0.0% 500.0 200.0 1093.55 4230.9
duration Float 3 33.3% 24.0 12.0 21.0 24.0
years Int 3 0.0% 2 2 3.0 4
flag Float 2 66.7% NaN 1.0 1.0 1.0
comments Char 6 0.0% Combined savings account 9 21.167 35

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfsummarizer-0.1.6.tar.gz (10.2 kB view details)

Uploaded Source

File details

Details for the file dfsummarizer-0.1.6.tar.gz.

File metadata

  • Download URL: dfsummarizer-0.1.6.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/51.1.0 requests-toolbelt/0.8.0 tqdm/4.61.2 CPython/3.6.10

File hashes

Hashes for dfsummarizer-0.1.6.tar.gz
Algorithm Hash digest
SHA256 67cf3651351cd004e88914368bf665b6b1b29eb43af4d9e330de9ae3b3a3e987
MD5 8c2bbe46a1681a9ecc481eed732eb827
BLAKE2b-256 44cd5efedb4ef77880a505ea8100b6eedd8d98e9fe56c23c607aa6276bcae32f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page