Skip to main content

A variety of smart tools to make analytics easy

Project description

smart_tools: tools to make data analysis easy

smart_tools contains a collection of command-line tools developed in Python. It aims in performing common data analyst activities easier.

Table of Contents

Where to get it

The source code is currently hosted on GitHub at: https://github.com/arcot23/smart_tools

Binary installers for the released version are available at the Python Package Index (PyPI)

# PyPI
python -m pip install smart-tools

Dependencies

How to use command-line tools

To get help, simply run respective executable with -h argument from your terminal. For example dissector can be run with dissector.exe -h. Run the command with positional arguments which are mandatory, but review the optional arguments dissector.exe dir file*.txt.

To easily access these command-line tools, add the executable's directory to PATH (in Windows) environment variable $Env:PATH. Most tools also depends on a config.yaml file for certain additional settings.

dissector.exe
morpher.exe
comparator.exe
aggregator.exe
fusioner.exe
└── config/
    ├── dissector_config.yaml
    ├── morpher_config.yaml
    ├── comparator_config.yaml
    ├── aggregator_config.yaml
    ├── fusioner_config.yaml
    └── ...

All command-line tools takes an input and generates an output. Input is typically a directory dir together with a file or files file. Output is created under dir which comprises an output directory and output files. dir can be a relative path from where the command is run or an absolute path. The folder hierarchy listed below shows the structure.

dir
├── file1.txt
├── file2.txt
├── ...
├── .d/
│   └── dissector_result.xlsx
├── .m/
│   └── morpher_result.xlsx
├── .c/
│   └── comparator_result.xlsx
├── .a/
│   └── aggregator_result.xlsx
└── .f/
    └── fusioner_result.xlsx

Dissector

dissector.exe is a command-line tool to analyze CSV files. The input file can be a single file or files from a directory dir that have a common column separator sep. The dissected results can be generated in the form of an excel file (xlsx) or text (json or csv). By default, the analysis is run on the entire content of the file i.e., without any filters. But slicers help slice data and run analysis.

usage: dissector.exe [-h] [--to {xlsx,json,csv}] [--sep SEP]
                    [--slicers [SLICERS ...]] [--nsample NSAMPLE]
                    [--outfile OUTFILE] [--config CONFIG]
                    dir file

positional arguments:
  dir                   Input directory
  file                  Input file (for multiple files use wildcard)

optional arguments:
  -h, --help            show this help message and exit
  --to {xlsx,json,csv}  Save result to xlsx or json or csv (default: xlsx)
  --sep SEP             Column separator (default: ,)
  --slicers [SLICERS ...]
                        Informs how to slice data (default: for no slicing)
  --nsample NSAMPLE     Number of samples (default: 10)
  --outfile OUTFILE     Output file name (default: dissect_result)
  --config CONFIG       Config file for meta data (default:
                        `.\config\dissector_config.yaml`)

The output gives the following information for each column element in the input file(s).

  • column: column name.
  • strlen: minimum and maximum string length.
  • nnull: count of NANs and empty strings.
  • nrow: number of rows.
  • nunique: number of unique values.
  • nvalue: number of rows with values.
  • freq: frequency distribution of top n values. n is configured in dissector_config.yaml.
  • sample: a sample of top n values. n is configured in dissector_config.yaml.
  • symbols: non-alphanumic characters that are not in [a-zA-Z0-9]
  • n: column order.
  • filename: name of the input file from where the column was picked.
  • filetype: file type to which the file is associated to (e.g., csv).

The output also presents other additional info:

  • slice: The slice used to select. Slices represents filter conditions to select subsets of rows within a dataset.
  • timestamp: file modified date timestamp of the input file.
  • hash: md5 hash of the input file.
  • size: file size of the input file in bytes.

Ensure that a yaml config file is present at .\config\dissector_config.yaml in relation to dissector.exe prior to executing the command.

---
read_csv:
  skiprows: 0
  skipfooter: 0
  engine: 'python' # {'c', 'python', 'pyarrow'}
  encoding: 'latin-1' # {'utf-8', 'latin-1'}
  quotechar: '"'
  on_bad_lines: 'warn' # {'error', 'warn', 'skip'}
  dtype: 'str'
  keep_default_na: false

Examples

  • Fetch *.csv from .\temp and dissect them with , as column separator.

    dissector .\temp *.csv -s ,

  • Fetch myfile.text from c:\temp and dissect the file with ; as column separator.

    dissector c:\temp myfile.text -s ;

  • Fetch myfile.text from c:\temp and dissect the file with ; as column separator by slicing the data with a filter on COLUMN1 == 'VALUE' and also without filtering any.

    dissector c:\temp myfile.text -s ; --slicers "" "COLUMN1 == 'VALUE'"

  • Fetch myfile.text from c:\temp and dissect the file with TAB \t as column separator by slicing the data with a filter on a column name that has a space in it COLUMN 1 == 'VALUE'.

    dissector c:\temp myfile.txt -sep ';' --slicers "" "COLUMN 1 == 'VALUE'"

    Using powershell, read the arguments from a text file.

    Get-Content args.txt | ForEach-Object {
        $arguments = $_ -split '#'
        & dissector.exe $arguments
    }
    

    Here is a sample args.txt file.

    .\temp#*.csv#-s#,
    

Morpher

morpher.exe is a command-line tool to convert format of a file or files in a directory that have a common column separator. For example, convert file delimited by sep in dir from csv to xlsx or csv to json.

usage: morpher.exe [-h] [--sep SEP] [--replace] [--to {xlsx,json}] dir file

positional arguments:
  dir               Input directory
  file              Input file or files (wildcard)

optional arguments:
  -h, --help        show this help message and exit
  --sep SEP         Column separator (default: ,)
  --replace         Replace output file if it already exists (default: false)
  --to {xlsx,json}  Morph to xlsx or json (default: xlsx)

Comparator

comparator.exe is a command-line tool to compare one file with another file.

usage: comparator.exe [-h] [-s SEP] [-t {xlsx,json,csv}] file1 file2

positional arguments:
  file1                 File to compare
  file2                 File to compare with

optional arguments:
  -h, --help            show this help message and exit
  -s SEP, --sep SEP     Column separator (default: `,`)
  -t {xlsx,json,csv}, --to {xlsx,json,csv}
                        Save result to xlsx or json or csv (default: `xlsx`)

Aggregator

aggregator.exe is a command-line tool to aggregate two or more file together into one.

usage: aggregator.py [-h] [--sep SEP] [--to {xlsx,json,csv}]
                     [--outfile OUTFILE] [--config CONFIG]
                     dir file

positional arguments:
  dir                   Input directory
  file                  Input file or files (for multiple files use wildcard)

optional arguments:
  -h, --help            show this help message and exit
  --sep SEP             Column separator (default: `,`)
  --to {xlsx,json,csv}  Save result to xlsx or json or csv (default: `xlsx`)
  --outfile OUTFILE     Output directory and file name (default:
                        .\.a\aggregated_result)
  --config CONFIG       Config file for meta data (default:
                        `.\config\aggregator_config.yaml`)

Fusioner

aggregator.exe is a command-line tool to aggregate two or more file together into one.

usage: fusioner.py [-h] [--sep SEP] [--outfile OUTFILE] [--config CONFIG] file

positional arguments:
  file               Input file

optional arguments:
  -h, --help         show this help message and exit
  --sep SEP          Column separator (default: ,)
  --outfile OUTFILE  Output directory and file name (default:
                     .\.f\fusioner_result)
  --config CONFIG    Config file for ETL (default:
                     `.\config\fusioner_config.toml`)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_tools-0.10.1.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

smart_tools-0.10.1-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file smart_tools-0.10.1.tar.gz.

File metadata

  • Download URL: smart_tools-0.10.1.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for smart_tools-0.10.1.tar.gz
Algorithm Hash digest
SHA256 120834816cb17694b02b73ba71fa34b1ffa175e2490822da66d1e93a722241cd
MD5 c69d449b4a48c5e4c5ae7fb49a26ccf7
BLAKE2b-256 c873d2991952f19adb997f7fed29911d407b38381e440ae26054ae0dd1037df9

See more details on using hashes here.

File details

Details for the file smart_tools-0.10.1-py3-none-any.whl.

File metadata

  • Download URL: smart_tools-0.10.1-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for smart_tools-0.10.1-py3-none-any.whl
Algorithm Hash digest
SHA256 920dfec5edb4a00259ce3b14bdb952cd129a71d40a2a055c6510358b3556c4ac
MD5 112caa06501c75568d2a6b57aa84f8ac
BLAKE2b-256 4085a381a8c6e5b295f27556d3fbef03cccf845628c2477e67a0ae84e6e45e34

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page