Skip to main content

A data processing orcestration tool.

Project description

https://raw.githubusercontent.com/rbturnbull/django-crunch/main/docs/images/crunch-banner.svg

testing badge coverage badge docs badge black badge

A data processing orcestration tool. Crunch allows you to visualize the datasets, orchestrate and manage the processing online and present the results to the world.

Description

Crunch coordinates three components. First there is a web application in the cloud. Crunch includes a Django app which can be included in a website built using the Django framework for building database driven websites. The website allows users to create what we call datasets. Each dataset corresponds to one collection of files and metadata which is designed to be run through the workflow. Each dataset has its own page on the website which displays all the information about it.

The second component is data storage. The main use-case for crunch is when your datasets are so many and so large that you cannott fit them all on to the disk where you are doing your computation. With crunch, the data can be stored in any of the media storage options available with Django. These can be Amazon S3, Google Cloud or many other storage options.

The third component is the client which runs at the place where you are performing the computation. This could be in a high-performance computing environment, or it could be on a virtual machine in the cloud or it could be just on your own laptop. The user just runs the crunch command line tool. The command line tool communicates with the website to find out which dataset should be processed next, then it copies it from storage and saves it locally. Then it processes the dataset using a pre-defined workflow provided by the user. When it is finished, it copies back the data to the storage with the resulting files. At each stage the user can see the status on the website interface. If you run the crunch loop` command then the client continually loops through the datasets until they are completely finished. You can run as many clients in parallel as you have computing resources.

Once each dataset is processed, the resulting files can be accessed via the website. The permissions for the website can be set dynamically so that users can restrict access to just the team for while the data is being processed and once the results are ready for the world then you can allow access to the public.

Installation

The crunch app for a Django website and the command-line client are installed with pip:

pip install git+https://github.com/rbturnbull/django-crunch

Install the crunch app to the Djanco website project by adding it to the settings:

INSTALLED_APPS += [
    "crunch",
]

Then add the urls to your main urls.py:

urlpatterns += [
    path('crunch/', include('crunch.django.app.urls'))),
]

The path crunch/ can be changed to be whatever you choose.

Usage

Create projects, datasets, items and attributes on the website using the HTML interface, the crunch command-line client or the Python API.

Upload initial data for each dataset as needed to the storage using the Crunch HTML interface or direct to the folder for each dataset on the storage.

Then process each dataset at the location where you are performing your compute with the crunch client. All datasets can be processed with the single command:

crunch loop

Credits

Robert Turnbull, Mar Quiroga and Simon Mutch from the Melbourne Data Analytics Platform.

Publication and citation details to follow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django_crunch-0.1.12.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

django_crunch-0.1.12-py3-none-any.whl (49.5 kB view details)

Uploaded Python 3

File details

Details for the file django_crunch-0.1.12.tar.gz.

File metadata

  • Download URL: django_crunch-0.1.12.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/22.1.0

File hashes

Hashes for django_crunch-0.1.12.tar.gz
Algorithm Hash digest
SHA256 a16ca77c52f9973dfc369c9e053d2ca669658c753c9052635bf68b2178ba904b
MD5 944dfe0911683ecb888faeb33a1f4cc1
BLAKE2b-256 bef375f50389d4988e7392159273169c377b8ac08ad56cba2b622f38c1919696

See more details on using hashes here.

File details

Details for the file django_crunch-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: django_crunch-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 49.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.8 Darwin/22.1.0

File hashes

Hashes for django_crunch-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 a380b407d4601f7502b772cfc795f9ba224788f629dcc3dc33718d6add514a19
MD5 1a3c79604bacbb33c8738f74c9e2c6d0
BLAKE2b-256 625285a377f65994176ab0d6bd9339b844dc94f5e7ab0d05b21cb1cf33b2ed55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page