This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

Datagristle is a toolbox of tough and flexible data connectors and analyzers. It’s kind of an interactive mix between ETL and data analysis optimized for rapid analysis and manipulation of a wide variety of data.

It’s neither an enterprise ETL tool, nor an enterprise analysis, reporting, or data mining tool. It’s intended to be an easily-adopted tool for technical analysts that combines the most useful subset of data transformation and analysis capabilities necessary to do 80% of the work. Its open source python codebase allows it to be easily extended to with custom code to handle that always challenging last 20%.

Current Status: Strong support for easy analysis, simple transformations of csv files, ability to create data dictionaries, and emerging data quality capabilities.

More info is on the DataGristle wiki here: https://github.com/kenfar/DataGristle/wiki

Next Steps:

  • attractive PDF output of gristle_determinator.py
  • metadata database population

Its objectives include:

  • multi-platform (unix, linux, mac os, windows with effort)
  • multi-language (primarily python)
  • free - no cripple-licensing
  • primary audience is programming data analysts - not non-technical analysts
  • primary environment is command-line rather than windows, graphical desktop or eclipse
  • extensible
  • allow a bi-directional iteration between ETL & data analysis
  • can quickly perform initial data analysis prior to longer-duration, deeper analysis with heavier-weight tools.

Installation

Dependencies

  • Python 2.6 or Python 2.7

Mature Utilities Provided in This Release:

  • gristle_slicer
    • Used to extract a subset of columns and rows out of an input file.
  • gristle_freaker
    • Produces a frequency distribution of multiple columns from input file.
  • gristle_viewer
    • Shows one record from a file at a time - formatted based on metadata.
  • gristle_determinator
    • Identifies file formats, generates metadata, prints file analysis report
    • This is the most mature - and also used by the other utilities so that you generally do not need to enter file structure info.
  • gristle_validator
    • Validates csv files by confirming that all records have the right number of fields, and by apply a json schema full of requirements to each record.
  • gristle_dir_merger
    • Used to consolidate large directories with options to control matching criteria as well as matching actions.

gristle_validator

Splits a csv file into two separate files based on how records pass or fail
validation checks:
   - Field count - checks the number of fields in each record against the
     number required.  The correct number of fields can be provided in an
     argument or will default to using the number from the first record.
   - Schema - uses csv file requirements defined in a json-schema file for
     quality checking.  These requirements include the number of fields,
     and for each field - the type, min & max length, min & max value,
     whether or not it can be blank, existance within a list of valid
     values, and finally compliance with a regex pattern.

The output can just be the return code (0 for success, 1+ for errors), can
be some high level statistics, or can be the csv input records split between
good and erroneous files.  Output can also be limited to a random subset.

Examples:
   $ gristle_validator  sample.csv -f 3
         Prints all valid input rows to stdout, prints all records with
         other than 3 fields to stderr along with an extra final field that
         describes the error.
   $ gristle_validator  sample.csv
         Prints all valid input rows to stdout, prints all records with
         other than the same number of fields found on the first record to
         stderr along with an extra final field that describes the error.
   $ gristle_validator  sample.csv  -d '|' --hasheader
         Same comparison as above, but in this case the file was too small
         or complex for the pgm to automatically determine csv dialect, so
         we had to explicitly give that info to program.
   $ gristle_validator  sample.csv --outgood sample_good.csv --outerr sample_err.csv
         Same comparison as above, but explicitly splits good and bad data
         into separate files.
   $ gristle_validator  sample.csv --randomout 1
         Same comparison as above, but only writes a random 1% of data out.
   $ gristle_validator  sample.csv --silent
         Same comparison as above, but writes nothing out.  Exit code can be
         used to determine if any bad records were found.
   $ gristle_validator  sample.csv --validschema sample_schema.csv
         The above command checks both field count as well as validations
         described in the sample_schema.csv file.  Here's an example of what
         that file might look like:
            items:
                - title:            rowid
                  blank:            False
                  required:         True
                  dg_type:          integer
                  dg_minimum:       1
                  dg_maximum:       60
                - title:            start_date
                  blank:            False
                  minLength:        8
                  maxLength:        10
                  pattern:          '[0-9]*/[0-9]*/[1-2][0-9][0-9][0-9]'
                - title:            location
                  blank:            False
                  minLength:        2
                  maxLength:        2
                  enum:             ['ny','tx','ca','fl','wa','ga','al','mo']

gristle_slicer

Extracts subsets of input files based on user-specified columns and rows.
The input csv file can be piped into the program through stdin or identified
via a command line option.  The output will default to stdout, or redirected
to a filename via a command line option.

The columns and rows are specified using python list slicing syntax -
so individual columns or rows can be listed as can ranges.   Inclusion
or exclusion logic can be used - and even combined.

Examples:
   $ gristle_slicer sample.csv
                Prints all rows and columns
   $ gristle_slicer sample.csv -c":5, 10:15" -C 13
                Prints columns 0-4 and 10,11,12,14 for all records
   $ gristle_slicer sample.csv -C:-1
                Prints all columns except for the last for all records
   $ gristle_slicer sample.csv -c:5 -r-100
                Prints columns 0-4 for the last 100 records
   $ gristle_slicer sample.csv -c:5 -r-100 -d'|' --quoting=quote_all
                Prints columns 0-4 for the last 100 records, csv
                dialect info (delimiter, quoting) provided manually)
   $ cat sample.csv | gristle_slicer -c:5 -r-100 -d'|' --quoting=quote_all
                Prints columns 0-4 for the last 100 records, csv
                dialect info (delimiter, quoting) provided manually)

gristle_freaker

Creates a frequency distribution of values from columns of the input file
and prints it out in columns - the first being the unique key and the last
being the count of occurances.


Examples:
   $ gristle_freaker sample.csv -d '|'  -c 0
                Creates two columns from the input - the first with
                unique keys from column 0, the second with a count of
                how many times each exists.
   $ gristle_freaker sample.csv -d '|'  -c 0 --sortcol 1 --sortorder forward --writelimit 25
                In addition to what was described in the first example,
                this example adds sorting of the output by count ascending
                and just prints the first 25 entries.
   $ gristle_freaker sample.csv -d '|'  -c 0 --sampling_rate 3 --sampling_method interval
                In addition to what was described in the first example,
                this example adds a sampling in which it only references
                every third record.
   $ gristle_freaker sample.csv -d '|'  -c 0,1
                Creates three columns from the input - the first two
                with unique key combinations from columns 0 & 1, the
                third with the number of times each combination exists.
   $ gristle_freaker sample.csv -d '|'  -c -1
                Creates two columns from the input - the first with unique
                keys from the last column of the file (negative numbers
                wrap), then a second with the number of times each exists.
   $ gristle_freaker sample.csv -d '|'  --columntype all
                Creates two columns from the input - all columns combined
                into a key, then a second with the number of times each
                combination exists.
   $ gristle_freaker sample.csv -d '|'  --columntype each
                Unlike the other examples, this one performs a separate
                analysis for every single column of the file.  Each analysis
                produces three columns from the input - the first is a
                column number, second is a unique value from the column,
                and the third is the number of times that value appeared.
                This output is repeated for each column.

gristle_viewer

Displays a single record of a file, one field per line, with field names
displayed as labels to the left of the field values.  Also allows simple
navigation between records.

Examples:
   $ gristle_viewer sample.csv -r 3
                Presents the third record in the file with one field per line
                and field names from the header record as labels in the left
                column.
   $ gristle_viewer sample.csv -r 3  -d '|' -q quote_none
                In addition to what was described in the first example this
                adds explicit csv dialect overrides.

gristle_determinator

Analyzes the structures and contents of csv files in the end producing a
report of its findings.  It is intended to speed analysis of csv files by
automating the most common and frequently-performed analysis tasks.  It's
useful in both understanding the format and data and quickly spotting issues.

Examples:
   $ gristle_determinator japan_station_radiation.csv
                This command will analyze a file with radiation measurements
                from various Japanese radiation stations.

    File Structure:
    format type:       csv
    field cnt:         4
    record cnt:        100
    has header:        True
    delimiter:
    csv quoting:       False
    skipinitialspace:  False
    quoting:           QUOTE_NONE
    doublequote:       False
    quotechar:         "
    lineterminator:    '\n'
    escapechar:        None

    Field Analysis Progress:
    Analyzing field: 0
    Analyzing field: 1
    Analyzing field: 2
    Analyzing field: 3

    Fields Analysis Results:

        ------------------------------------------------------
        Name:             station_id
        Field Number:     0
        Wrong Field Cnt:  0
        Type:             timestamp
        Min:              1010000001
        Max:              1140000006
        Unique Values:    99
        Known Values:     99
        Top Values not shown - all values are unique

        ------------------------------------------------------
        Name:             datetime_utc
        Field Number:     1
        Wrong Field Cnt:  0
        Type:             timestamp
        Min:              2011-02-28 15:00:00
        Max:              2011-02-28 15:00:00
        Unique Values:    1
        Known Values:     1
        Top Values:
            2011-02-28 15:00:00                      x 99 occurrences

        ------------------------------------------------------
        Name:             sa
        Field Number:     2
        Wrong Field Cnt:  0
        Type:             integer
        Min:              -999
        Max:              52
        Unique Values:    35
        Known Values:     35
        Mean:             2.45454545455
        Median:           38.0
        Variance:         31470.2681359
        Std Dev:          177.398613681
        Top Values:
            41                                       x 7 occurrences
            42                                       x 7 occurrences
            39                                       x 6 occurrences
            37                                       x 5 occurrences
            46                                       x 5 occurrences
            17                                       x 4 occurrences
            38                                       x 4 occurrences
            40                                       x 4 occurrences
            45                                       x 4 occurrences
            44                                       x 4 occurrences

        ------------------------------------------------------
        Name:             ra
        Field Number:     3
        Wrong Field Cnt:  0
        Type:             integer
        Min:              -888
        Max:              0
        Unique Values:    2
        Known Values:     2
        Mean:             -556.121212121
        Median:           -888.0
        Variance:         184564.833792
        Std Dev:          429.610095077
        Top Values:
            -888                                     x 62 occurrences
            0                                        x 37 occurrences

gristle_metadata

Gristle_metadata provides a command-line interface to the metadata database.
It's mostly useful for scripts, but also useful for occasional direct
command-line access to the metadata.

Examples:
   $ gristle_metadata --table schema --action list
                Prints a list of all rows for the schema table.
   $ gristle_metadata --table element --action put --prompt
                Allows the user to input a row into the element table and
                prompts the user for all fields necessary.

gristle_md_reporter

Gristle_md_reporter allows the user to create data dictionary reports that
combine information about the collection and fields along with field value
descriptions and frequencies.

Examples:
   $ gristle_md_reporter --report datadictionary --collection_id 2
                Prints a data dictionary report of collection_id 2.
   $ gristle_md_reporter --report datadictionary --collection_name presidents
                Prints a data dictionary report of the president collection.
   $ gristle_md_reporter --report datadictionary --collection_id 2 --field_id 3
                Prints a data dictionary report of the president collection,
                only shows field-level information for field_id 3.

gristle_dir_merger

Gristle_dir_merger consolidates directory structures of files.  Is both fast
and flexible with a variety of options for choosing which file to use based
on full (name and md5) and partial matches (name only) .

Examples
   $ gristle_dir_merger /tmp/foo /data/foo
         - Compares source of /tmp/foo to dest of /data/foo.
         - Files will be consolidated into /data/foo, and deleted from /tmp/foo.
         - Comparison will be: match-on-name-and-md5 (default)
         - Full matches will use: keep_dest (default)
         - Partial matches will use: keep_newest (default)
         - Bottom line: this is what you normally want.
   $ gristle_dir_merger /tmp/foo /data/foo --dry-run
         - Same as the first example - except it only prints what it would do
           without actually doing it.
         - Bottom line: this is a good step to take prior to running it for real.
   $ gristle_dir_merger /tmp/foo /data/foo -r
         - Same as the first example - except it runs recursively through
           the directories.
   $ gristle_dir_merger /tmp/foo /data/foo --on-partial-match keep-biggest
         - Comparison will be: match-on-name-and-md5 (default)
         - Full matches will use: keep_dest (default)
         - Partial matches will use: keep_biggest (override)
         - Bottom line: this is a good combo if you know that some files
           have been modified on both source & dest, and newest isn't the best.
   $ gristle_dir_merger /tmp/foo /data/foo --match-on-name-only --on-full-match keep-source
         - Comparison will be: match-on-name-only (override)
         - Full matches will use: keep_source (override)
         - Bottom line: this is a good way to go if you have
           files that have changed in both directories, but always want to
           use the source files.

Licensing

  • Gristle uses the BSD license - see the separate LICENSE file for further information

v0.58 - 2014-08

  • gristle_dir_merger
    • initial addition to toolkit. Merges directories of files using a variety of matching criteria and matching actions.

v0.57 - 2014-07

  • gristle_processor
    • initial addition to toolkit. Provides ability to scan through directory structure recursively, and delete files that match config criteria.

v0.56 - 2014-03

  • gristle_determinator
    • added hasnoheader arg
    • fixed problem printing top_values on empty file with header
  • gristle_validator
    • added hasnoheader arg
  • gristle_freaker
    • added hasnoheader arg

v0.55 - 2014-02

  • gristle_determinator - fixed a few problems:
    • the ‘Top Values not shown - all unique’ message being truncated
    • floats not handled correctly for stddev & variance
    • quoted ints & floats not handled

v0.54 - 2014-02

  • gristle_validator - major updates to allow validation of csv files based on the json schema standard, with help from the Validictory module.

v0.53 - 2014-01

  • gristle_freaker - major updates to enable distributes on all columns to be automatically gathered through either (all or each) args. ‘All’ combines all columns into a single tuple prior to producing distribution. ‘Each’ creates a separate distribution for every column within the csv file.
  • travisci - added support and started using this testing service.
Release History

Release History

0.58

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.57

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.56

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.55

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.54

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.53

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.52

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.51

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.50

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.49

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.48

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.47

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.46

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.45

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
datagristle-0.58.tar.gz (427.8 kB) Copy SHA256 Checksum SHA256 Source Sep 21, 2014

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting