Skip to main content

Analyze delimiter-separated values files

Project description

Installation

Depending on if you want only this tool, the full set of PNU tools, or PNU plus a selection of additional third-parties tools, use one of these commands:

pip install pnu-adsv
pip install PNU
pip install pytnix

ADSV(1)

NAME

adsv - Analyze delimiter-separated values files

SYNOPSIS

adsv [-d|--delimiter CHAR] [-e|--encoding STRING] [-f|--fields LIST] [-F|--flatten] [-h|--hide INT] [-m|--min INT] [-M|--max INT] [-t|--top INT] [--debug] [--help|-?] [--version] [--] filename [...]

DESCRIPTION

The adsv utility analyzes delimiter-separated values files, such as Comma-Separated Values .csv or Tab-Separated Values .tsv files, and either prints information about their structure and the data in each of their fields, or prints a selection of fields in the order requested.

The information gathered are:

  • for the file:
    • the character set encoding
    • the CSV dialect (characters used for delimiting, quoting, escaping or lines terminating. Plus the use or not of double quoting)
    • the presence or not of a headers line
    • the number of lines and fields
  • for each field:
    • its number and header
    • the number of distinct values
    • the values type (strings, integers, floating numbers, complex numbers, date and time (whatever their format))
    • the values by descending count
    • the values range by ascending order using the detected type (useful for numbers and dates)

When analyzing a DSV dataset, this allows for a quick and automated way of getting global information about the contents, and explore any oddities...

There are options:

  • to control and limit what is printed (-h|--hide, -m|--min, -M|--max and -t|--top),
  • to avoid (or correct) the detection of the character set encoding and delimiter (-d|--delimiter, -e|--encoding):
    • the character set detection can take a long time with big files, so if you know that the file is in "Windows-1252" or "utf-8" encoding, it's quicker to say it...

If you use the -f|--fields option, you'll skip printing the file analysis, and instead print the selected fields in the order requested, using the detected delimiting, quoting, escaping and line terminating characters.

If you encounter multi-lines fields and want to "flatten" them to single lines, you can use the -F|--flatten option for that.

OPTIONS

Options Use
-d|--delimiter CHAR Specify delimiter to be CHAR
-e|--encoding STRING Specify charset encoding to be STRING (because detecting encoding can take a long time!)
-f|--fields LIST Extract LISTed fields values in given order (ex: 6,2-4,1 with fields numbered from 1)
-F|--flatten Make multi-lines fields single line
-h|--hide INT Hide the display of distinct values above INT % (default is 20%)
-m|--min INT Only display distinct values whose count >= INT (default is to display all distinct values)
-M|--max INT Only display INT lines of distinct values (default is to display all distinct values, within the hide limit)
-t|--top INT Only display the top/bottom INT lines of values (default is to display the 5 bottom and top lines)
--debug Enable debug mode
--help|-? Print usage and a short help message and exit
--version Print version and exit
-- Options processing terminator

ENVIRONMENT

The ADSV_DEBUG environment variable can also be set to any value to enable debug mode.

EXIT STATUS

The adsv utility exits 0 on success, and >0 if an error occurs.

SEE ALSO

cut(1), file(1)

STANDARDS

The adsv utility is not a standard UNIX command.

This implementation tries to follow the PEP 8 style guide for Python code.

The DSV dialects that can be handled are those compatible with RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.

PORTABILITY

Tested OK under Windows.

HISTORY

This implementation was made for the PNU project.

I do this kind of analysis with each dataset I have to work with. Last time I did that, I decided that it was about time to fully automate the process, especially as I was working with fields containing multi-lines values...

LICENSE

It is available under the 3-clause BSD license.

AUTHORS

Hubert Tournier

CAVEATS

Using "Sep=X" as a first line in order to set the X character as a delimiter is not supported.

There is no support either for potential commented lines inside the data (for example, with /etc/passwd files under Unix), but it's not part of any recognized DSV dialect anyway.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pnu_adsv-1.0.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

pnu_adsv-1.0.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file pnu_adsv-1.0.0.tar.gz.

File metadata

  • Download URL: pnu_adsv-1.0.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for pnu_adsv-1.0.0.tar.gz
Algorithm Hash digest
SHA256 69a4e3c45d521bac81f3ad818d9e4be487613a6831c301165a7bc502f7f3e78e
MD5 4d54193c8e6b491ceb5e9278a533f8aa
BLAKE2b-256 4a5c1b820ff1db7b1a14415efa36a3b53900075be202e93147930b55e0298d96

See more details on using hashes here.

File details

Details for the file pnu_adsv-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pnu_adsv-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for pnu_adsv-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06bb3bef705d00ce670ad1651849d9d0bfc281185c37838b40c794557146097c
MD5 dca17d89ae1520d004f867e5aad2eb4f
BLAKE2b-256 d8c43d1ea4dd4b2c4ce8c28cae10922230050a2f0ae089256e6443b76a6b9a0d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page