Skip to main content

A data ingestion tool for bigger-than-memory data sets.

Project description

Note

For the latest source, discussion, etc, please visit the GitLab repository

data_digest: Quickly process bigger-than-memory data sets without a sweat!

This repository hosts the code for the data_digest tool used for computing simple statistics such as mean, standard deviation, max value and the histogram for data sets larger than available memory on a system. It was designed to allows users to process large data files on a laptop or small VM in a timely manner by taking advantage of all cpu cores and memory available for executing its computations, and spilling to disk whenever is necessary.

The main features of this tool are the following:

  • Simple and lean API to process large data files
  • Spills to disk if there is not sufficient memory available
  • Runs on as many cpu cores as the system has available for use
  • You can use it either programmatically or via a terminal

Requirements

  • Python 3.7+
  • Docker (optional)

Installation

You can install data_digest using either pip or by installing from source.

Pip

Installing from pip is simple and easy as:

$ pip install data_digest

Install from Source

To install data_digest from source, clone the repository from GitLab:

$ git clone https://gitlab.com/farrajota/jungle.ai-challenge-data-digest
$ cd jungle.ai-challenge-data-digest
$ make install
or
$ python setup.py install

How-to use

Via Python

To start processing large datasets that don’t fit in the available memory on a laptop or VM using data_digest, you first need to import the package lib in Python and access the data_digest() method to compute the data stored in csv files:

>>> from data_digest import data_digest

Next, to use this method you are required to pass the location of the file path in the filesystem where the data is stored that you wan’t to process:

>>> output = data_digest('/path/to/file.csv')

The method will load the .csv file from disk and process the data using Dask to execute the computations and returns a summary of the statistics processed for each column of the data set. Because we are using dask as our processing backend, data sets that are bigger than memory can be executed using the available memory and, if needed, it can use disk as temporary storage for computations, allowing for effective use of resources.

Furthermore, you can select which metrics to compute by enabling / disabling them when calling data_digest(). By default, all metrics are disabled when calling the main method. But if you want to select a metric or set of metrics to compute, say only the mean and max values of each column of the data set, here is an example of how you can do this:

>>> output = data_digest('/path/to/file.csv',
                         compute_mean=True,
                         compute_max=True)

This will compute the mean and max values of the data set. Moreover, you can enable or disable any metric you want and get exactly want you need.

Via Terminal

You can perform the same operations in Python with the data_digest module in the terminal. When installing the package, a cli api is also installed on your system which allows you to quickly start using the tool with little to no effort by calling data-digest in the terminal.

To list check all available options, simply do:

$ data-digest --help
Usage: data-digest [OPTIONS] PATH_TO_DATASET

Quickly process bigger-than-memory data sets without a sweat!

Options:
--mean               Compute the mean value for each column of the dataset.
--stddev             Compute the standard deviation for each column of the
                     dataset.
--max                Compute the max value for each column of the dataset.
--histogram INTEGER  Compute the histogram for each column of the dataset.
--output-path TEXT   Store the output report into a json file
--help               Show this message and exit.

You can see that it has the same options than the data_digest() method in data_digest module. Just And like with the Using the last example in the previous section, we can compute the mean and the max values of each column of a data set by typping the following command on the terminal:

$ data-digest --mean --max </path/to/file.csv>

So, computing what you need is as simply as giving a path of the data file, selecting the metrics you want to compute, and let the tool do the rest.

Documentation

Documentation for the package can be found here.

Running it on a Docker container

You can run the package on a docker container with limited memory very easily. First, you need to build the docker image with the package installed:

docker build -t <docker_image_name> -f docker/dockerfile .

After the image is built, you need to create a directory in your filesystem and put the data file you want to process inside, because you will need to pass that dir path as a volume when running the docker container. For example, create a temporary folder in the repo’s directory:

mkdir -p tmp

Then, run the following command to start and run a docker container to process the data file:

docker run --name my_data_digest_512mb \
    --rm \
            -v $(PWD)/tmp:/home/temp_cache \
            --memory="512m" \
            <docker_image_name> /home/temp_cache/<filename.csv> --mean  # computes only the mean metric

This command will process the data set using only 512mb of memory and, depending on its size, it may take a few minutes to complete. You can change the amount of memory assigned to the container to test different settings.

Additionally, you can do this steps using the following commands as well:

  1. Build the docker image

    make docker-build
    
  2. Run a the previous example of computing the mean metric for a data set

    make docker-run FILE_PATH=/home/temp_cache/<filename.csv>
    

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for data-digest, version 0.0.9
Filename, size File type Python version Upload date Hashes
Filename, size data_digest-0.0.9-py2.py3-none-any.whl (9.0 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size data_digest-0.0.9.tar.gz (8.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page