Skip to main content

A nifty data processing framework, based on data packages

Project description

logo DataFlows

Travis Coveralls PyPI - Python Version Gitter chat

DataFlows is a simple and intuitive way of building data processing flows.

  • It's built for small-to-medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...
  • It's built upon the foundation of the Frictionless Data project - which means that all data produced by these flows is easily reusable by others.
  • It's a pattern not a heavy-weight framework: if you already have a bunch of download and extract scripts this will be a natural fit

Read more in the Features section below.

QuickStart

Install dataflows via pip install.

(If you are using minimal UNIX OS, run first sudo apt install build-essential)

Then use the command-line interface to bootstrap a basic processing script for any remote data file:

# Install from PyPi
$ pip install dataflows

# Inspect a remote CSV file
$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv
Writing processing code into academy_csv.py
Running academy_csv.py
academy:
#     Year           Ceremony  Award                                 Winner  Name                            Film
      (string)      (integer)  (string)                            (string)  (string)                        (string)
----  ----------  -----------  --------------------------------  ----------  ------------------------------  -------------------
1     1927/1928             1  Actor                                         Richard Barthelmess             The Noose
2     1927/1928             1  Actor                                      1  Emil Jannings                   The Last Command
3     1927/1928             1  Actress                                       Louise Dresser                  A Ship Comes In
4     1927/1928             1  Actress                                    1  Janet Gaynor                    7th Heaven
5     1927/1928             1  Actress                                       Gloria Swanson                  Sadie Thompson
6     1927/1928             1  Art Direction                                 Rochus Gliese                   Sunrise
7     1927/1928             1  Art Direction                              1  William Cameron Menzies         The Dove; Tempest
...

# dataflows create a local package of the data and a reusable processing script which you can tinker with
$ tree
.
├── academy_csv
│   ├── academy.csv
│   └── datapackage.json
└── academy_csv.py

1 directory, 3 files

# Resulting 'Data Package' is super easy to use in Python
[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python
Python 3.6.1 (default, Mar 27 2017, 00:25:54)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datapackage import Package
>>> pkg = Package('academy_csv/datapackage.json')
>>> it = pkg.resources[0].iter(keyed=True)
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}

# You now run `academy_csv.py` to repeat the process
# And obviously modify it to add data modification steps

Features

  • Trivial to get started and easy to scale up
  • Set up and run from command line in seconds ...
    • dataflows init => flow.py
    • python flow.py
  • Validate input (and esp source) quickly (non-zero length, right structure, etc.)
  • Supports caching data from source and even between steps
    • so that we can run and test quickly (retrieving is slow)
  • Immediate test is run: and look at output ...
    • Log, debug, rerun
  • Degrades to simple python
  • Conventions over configuration
  • Log exceptions and / or terminate
  • The input to each stage is a Data Package or Data Resource (not a previous task)
    • Data package based and compatible
  • Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package
  • A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers

Learn more

Dive into the Tutorial to get a deeper glimpse into everything that dataflows can do. Also review this list of Built-in Processors, which also includes an API reference for each one of them.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflows-0.5.2.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataflows-0.5.2-py2.py3-none-any.whl (60.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file dataflows-0.5.2.tar.gz.

File metadata

  • Download URL: dataflows-0.5.2.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dataflows-0.5.2.tar.gz
Algorithm Hash digest
SHA256 e024bd051e1de551e2c0512e2e224a65611124286c5959c95e109e9b2551818c
MD5 6adb9a0674b1ff89156ef84ee625a48a
BLAKE2b-256 6b1a1722eb43392c0f9824e4ca325ea84a478c1ab682a84443dbd68b6912064f

See more details on using hashes here.

File details

Details for the file dataflows-0.5.2-py2.py3-none-any.whl.

File metadata

  • Download URL: dataflows-0.5.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 60.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dataflows-0.5.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 26349e0622441953a574218237e880d8810540405f089abab7d94af185f96c25
MD5 55bc0e372e3dc4a3d7421b44859c15fd
BLAKE2b-256 d7251c2fdc01ded1c08a946dea3686ccda488f5aeba2af406ac1c2a4653141a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page