Data Processing implementation in Python

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 2
- Python :: 3

Project description

Dampr - Pure Python Data Processing

Dampr is intended for use as single machine data processing: it's natively out of core, supports map and reduce side joins, associative reduce combiners, and provides a high level interface for constructing Dataflow DAGs.

It's reasonably fast, easy to get started, and scales linearly by core. It has no external dependencies, making it extremely lightweight and easy to install. It has reasonable REPL support for data analysis, though there are better tools for the job for it.

Features

Self-Contained: No external dependencies and simple to install
High-Level API: Easy computation
Out-Of-Core: Scales up to 100s of GB to TBs of data. No need to worry about Out of Memory errors!
Reasonably Fast: Linearly scales to number of cores on the machine
Powerful: Provides a number of advanced joins and other functions for complex workflows

Setup

pip install dampr

python setup.py install

API

docs/dampr/index.html

Examples

Look at the examples directory for more complete examples.

Similarly, the tests are intended to be fairly readable as well. You can view them in the tests directory.

Example - WC

import sys 

from dampr import Dampr

def main(fname):

    wc = Dampr.text(fname) \
            .map(lambda v: len(v.split())) \
            .a_group_by(lambda x: 1) \
            .sum()

    results = wc.run("word-count")
    for k, v in results:
        print("Word Count:", v)

    results.delete()

if __name__ == '__main__':
    main(sys.argv[1])

Why not Dask for data processing?

Dask is great! I'd highly recommend it for fast analytics and datasets which don't need complex joins!

However.

Dask is really intended for in-memory computation and more analytics processing via interfaces like DataFrames. While it does have a reasonable bag implementation for data processing, it's missing some important features such as joins across large datasets. I have routinely run into OOM errors when processing datasets larger than memory when trying more complicated processes.

In that sense, Dampr is attempting to bridge that gap of complex data processing on a single machine and heavy-weight systems, geared toward ease of use.

Why not PySpark for data processing?

PySpark is great! I'd highly recommend it for extremely large datasets and cluster computation!

However.

PySpark requires large amounts of setup to really get going. It's the antithesis of "light-weight" and really geared for large scale production deployments. I personally don't like it for proof of concepts or one-offs; it requires just a bit too much tuning to get what you need.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 2
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.3

Jul 3, 2019

0.2.1

May 11, 2019

0.2.0

May 11, 2019

0.1.7

Dec 26, 2018

0.1.6

Dec 26, 2018

0.1.5

Dec 21, 2018

0.1.4

Nov 25, 2018

0.1.3

Nov 25, 2018

0.1.2

Nov 25, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dampr-0.2.3.tar.gz (28.6 kB view details)

Uploaded Jul 3, 2019 Source

File details

Details for the file dampr-0.2.3.tar.gz.

File metadata

Download URL: dampr-0.2.3.tar.gz
Upload date: Jul 3, 2019
Size: 28.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for dampr-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`7475a6d684989222ef3609a3c15e595c506e67589023c22b5cd6d22cea63dd05`
MD5	`57d265a424dc3a48e4c76f1c905b2730`
BLAKE2b-256	`382ac7fe0257ad3dde3b967bc34c22876775616d814ae9b5e8dab488019e72e8`

See more details on using hashes here.

dampr 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dampr - Pure Python Data Processing

Features

Setup

API

Examples

Example - WC

Why not Dask for data processing?

Why not PySpark for data processing?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes