A data ingestion tool for bigger-than-memory data sets.
For the latest source, discussion, etc, please visit the GitLab repository
data_digest: Quickly process bigger-than-memory data sets without a sweat!
This repository hosts the code for the data_digest tool used for computing simple statistics such as mean, standard deviation, max value and the histogram for data sets larger than available memory on a system. It was designed to allows users to process large data files on a laptop or small VM in a timely manner by taking advantage of all cpu cores and memory available for executing its computations, and spilling to disk whenever is necessary.
The main features of this tool are the following:
- Simple and lean API to process large data files
- Spills to disk if there is not sufficient memory available
- Runs on as many cpu cores as the system has available for use
- You can use it either programmatically or via a terminal
- Python 3.7+
- Docker (optional)
You can install data_digest using either pip or by installing from source.
Installing from pip is simple and easy as:
$ pip install data_digest
Install from Source
To install data_digest from source, clone the repository from GitLab:
$ git clone https://gitlab.com/farrajota/jungle.ai-challenge-data-digest $ cd jungle.ai-challenge-data-digest $ make install or $ python setup.py install
To start processing large datasets that don’t fit in the available memory on a laptop or VM using data_digest, you first need to import the package lib in Python and access the data_digest() method to compute the data stored in csv files:
>>> from data_digest import data_digest
Next, to use this method you are required to pass the location of the file path in the filesystem where the data is stored that you wan’t to process:
>>> output = data_digest('/path/to/file.csv')
The method will load the .csv file from disk and process the data using Dask to execute the computations and returns a summary of the statistics processed for each column of the data set. Because we are using dask as our processing backend, data sets that are bigger than memory can be executed using the available memory and, if needed, it can use disk as temporary storage for computations, allowing for effective use of resources.
Furthermore, you can select which metrics to compute by enabling / disabling them when calling data_digest(). By default, all metrics are disabled when calling the main method. But if you want to select a metric or set of metrics to compute, say only the mean and max values of each column of the data set, here is an example of how you can do this:
>>> output = data_digest('/path/to/file.csv', compute_mean=True, compute_max=True)
This will compute the mean and max values of the data set. Moreover, you can enable or disable any metric you want and get exactly want you need.
You can perform the same operations in Python with the data_digest module in the terminal. When installing the package, a cli api is also installed on your system which allows you to quickly start using the tool with little to no effort by calling data-digest in the terminal.
To list check all available options, simply do:
$ data-digest --help Usage: data-digest [OPTIONS] PATH_TO_DATASET Quickly process bigger-than-memory data sets without a sweat! Options: --mean Compute the mean value for each column of the dataset. --stddev Compute the standard deviation for each column of the dataset. --max Compute the max value for each column of the dataset. --histogram INTEGER Compute the histogram for each column of the dataset. --output-path TEXT Store the output report into a json file --help Show this message and exit.
You can see that it has the same options than the data_digest() method in data_digest module. Just And like with the Using the last example in the previous section, we can compute the mean and the max values of each column of a data set by typping the following command on the terminal:
$ data-digest --mean --max </path/to/file.csv>
So, computing what you need is as simply as giving a path of the data file, selecting the metrics you want to compute, and let the tool do the rest.
Documentation for the package can be found here.
Running it on a Docker container
You can run the package on a docker container with limited memory very easily. First, you need to build the docker image with the package installed:
docker build -t <docker_image_name> -f docker/dockerfile .
After the image is built, you need to create a directory in your filesystem and put the data file you want to process inside, because you will need to pass that dir path as a volume when running the docker container. For example, create a temporary folder in the repo’s directory:
mkdir -p tmp
Then, run the following command to start and run a docker container to process the data file:
docker run --name my_data_digest_512mb \ --rm \ -v $(PWD)/tmp:/home/temp_cache \ --memory="512m" \ <docker_image_name> /home/temp_cache/<filename.csv> --mean # computes only the mean metric
This command will process the data set using only 512mb of memory and, depending on its size, it may take a few minutes to complete. You can change the amount of memory assigned to the container to test different settings.
Additionally, you can do this steps using the following commands as well:
Build the docker image
Run a the previous example of computing the mean metric for a data set
make docker-run FILE_PATH=/home/temp_cache/<filename.csv>
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size data_digest-0.0.9-py2.py3-none-any.whl (9.0 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View hashes|
|Filename, size data_digest-0.0.9.tar.gz (8.6 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for data_digest-0.0.9-py2.py3-none-any.whl