dataflows

A nifty data processing framework, based on data packages

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.6
Topic
- Software Development :: Libraries :: Python Modules

Project description

# DataFlows

DataFlows is a novel and intuitive way of building data processing flows.

- It's built for medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...
- It's built upon the foundation of the Frictionless Data project - which means that all data prduced by these flows is easily reusable by others.

## QuickStart / Tutorial

Let's start with the traditional 'hello, world' example:

```python
from dataflows import Flow

data = [
{'data': 'Hello'},
{'data': 'World'}
]

def lowerData(row):
row['data'] = row['data'].lower()

f = Flow(
data,
lowerData
)
data, *_ = f.results()

print(data)

# -->
# [
# [
# {'data': 'hello'},
# {'data': 'world'}
# ]
# ]
```

This very simple flow takes a list of `dict`s and applies a row processing function on each one of them.

We can load data from a file instead:

```python
from dataflows import Flow, load

# beatles.csv:
# name,instrument
# john,guitar
# paul,bass
# george,guitar
# ringo,drums

def titleName(row):
row['name'] = row['name'].title()

f = Flow(
load('beatles.csv'),
titleName
)
data, *_ = f.results()

print(data)

# -->
# [
# [
# {'name': 'John', 'instrument': 'guitar'},
# {'name': 'Paul', 'instrument': 'bass'},
# {'name': 'George', 'instrument': 'guitar'},
# {'name': 'Ringo', 'instrument': 'drums'}
# ]
# ]
```

The source file can be a CSV file, an Excel file or a Json file. You can use a local file name or a URL for a file hosted somewhere on the web.

Data sources can be generators and not just lists or files. Let's take as an example a very simple scraper:

```python
from dataflows import Flow

from xml.etree import ElementTree
from urllib.request import urlopen

# Get from Wikipedia the population count for each country
def country_population():
# Read the Wikipedia page and parse it using etree
page = urlopen('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population').read()
tree = ElementTree.fromstring(page)
# Iterate on all tables, rows and cells
for table in tree.findall('.//table'):
if 'wikitable' in table.attrib.get('class', ''):
for row in table.findall('tr'):
cells = row.findall('td')
if len(cells) > 3:
# If a matching row is found...
name = cells[1].find('.//a').attrib.get('title')
population = cells[2].text
# ... yield a row with the information
yield dict(
name=name,
population=population
)

f = Flow(
country_population(),
)
data, *_ = f.results()

print(data)
# --->
# [
# [
# {'name': 'China', 'population': '1,391,090,000'},
# {'name': 'India', 'population': '1,332,140,000'},
# {'name': 'United States', 'population': '327,187,000'},
# {'name': 'Indonesia', 'population': '261,890,900'},
# ...
# ]
# ]
```

This is nice, but we do prefer the numbers to be actual numbers and not strings.

In order to do that, let's simply define their type to be numeric:

```python
from dataflows import Flow, set_type

def country_population():
# same as before
...

f = Flow(
country_population(),
set_type('population', type='number', groupChar=',')
)
data, *_ = f.results()

print(data)
# -->
# [
# [
# {'name': 'China', 'population': Decimal('1391090000')},
# {'name': 'India', 'population': Decimal('1332140000')},
# {'name': 'United States', 'population': Decimal('327187000')},
# {'name': 'Indonesia', 'population': Decimal('261890900')},
# ...
# ]
# ]

```

Data is automatically converted to the correct native Python type.

Apart from data-types, it's also possible to set other constraints to the data. If the data fails validation (or does not fit the assigned data-type) an exception will be thrown - making this method highly effective for validating data and ensuring data quality.

What about large data files? In the above examples, the results are loaded into memory, which is not always preferrable or acceptable. In many cases, we'd like to store the results directly onto a hard drive - without having the machine's RAM limit in any way the amount of data we can process.

We do it by using _dump_ processors:

```python
from dataflows import Flow, set_type, dump_to_path

def country_population():
# same as before
...

f = Flow(
country_population(),
set_type('population', type='number', groupChar=','),
dump_to_path('country_population')
)
*_ = f.process()

```

Running this code will create a local directory called `county_population`, containing two files:

```
├── country_population
│ ├── datapackage.json
│ └── res_1.csv
```

The CSV file - `res_1.csv` - is where the data is stored. The `datapackage.json` file is a metadata file, holding information about the data, including its schema.

We can now open the CSV file with any spreadsheet program or code library supporting the CSV format - or using one of the **data package** libraries out there, like so:

```python
from datapackage import Package
pkg = Package('country_population/res_1.csv')
it = pkg.resources[0].iter(keyed=True)
print(next(it))
# prints:
# {'name': 'China', 'population': Decimal('1391110000')}
```

Note how using the data package meta-data, data-types are restored and there's no need to 're-parse' the data. This also works with other types too, such as dates, booleans and even `list`s and `dict`s.

So far we've seen how to load data, process it row by row, and then inspect the results or store them in a data package.

Let's see how we can do more complex processing by manipulating the entire data stream:

```python
from dataflows import Flow, set_type, dump_to_path

# Generate all triplets (a,b,c) so that 1 <= a <= b < c <= 20
def all_triplets():
for a in range(1, 20):
for b in range(a, 20):
for c in range(b+1, 21):
yield dict(a=a, b=b, c=c)

# Yield row only if a^2 + b^2 == c^1
def filter_pythagorean_triplets(rows):
for row in rows:
if row['a']**2 + row['b']**2 == row['c']**2:
yield row

f = Flow(
all_triplets(),
set_type('a', type='integer'),
set_type('b', type='integer'),
set_type('c', type='integer'),
filter_pythagorean_triplets,
dump_to_path('pythagorean_triplets')
)
_ = f.process()

# -->
# pythagorean_triplets/res_1.csv contains:
# a,b,c
# 3,4,5
# 5,12,13
# 6,8,10
# 8,15,17
# 9,12,15
# 12,16,20
```

The `filter_pythagorean_triplets` function takes an iterator of rows, and yields only the ones that pass its condition.

The flow framework knows whether a function is meant to hande a single row or a row iterator based on its parameters:

- if it accepts a single `row` parameter, then it's a row processor.
- if it accepts a single `rows` parameter, then it's a rows processor.
- if it accepts a single `package` parameter, then it's a package processor.

Let's see a few examples of what we can do with a package processors.

First, let's add a field to the data:

```python
from dataflows import Flow, load, dump_to_path

def add_is_guitarist_column_to_schema(package):
# Add a new field to the first resource
package.pkg.resources[0]
.descriptor['schema']['fields']
.append(dict(
name='is_guitarist',
type='boolean'
))
# Must yield the modified datapackage
yield package.pkg
# And its resources
yield from package

def add_is_guitarist_column(row):
row['is_guitarist'] = row['instrument'] == 'guitar'
return row

f = Flow(
# Same one as above
load('beatles.csv'),
add_is_guitarist_column_to_schema,
add_is_guitarist_column,
dump_to_path('beatles_guitarists')
)
_ = f.process()

```

In this example we create two steps - one for adding the new field (`is_guitarist`) to the schema and another step to modify the actual data.

We can combine the two into one step:

```python
from dataflows import Flow, load, dump_to_path

def add_is_guitarist_column(package):

# Add a new field to the first resource
package.pkg.resources[0].descriptor['schema']['fields'].append(dict(
name='is_guitarist',
type='boolean'
))
# Must yield the modified datapackage
yield package.pkg

# Now iterate on all resources
resources = iter(package)
# Take the first resource
beatles = next(resources)

# And yield it with with the modification
def f(row):
row['is_guitarist'] = row['instrument'] == 'guitar'
return row

yield map(f, beatles)

f = Flow(
# Same one as above
load('beatles.csv'),
add_is_guitarist_column,
dump_to_path('beatles_guitarists')
)
_ = f.process()
```

The contract for the `package` processing function is simple:

First modify `package.pkg` (which is a `Package` instance) and yield it.

Then, yield any resources that should exist on the output, with or without modifications.

In the next example we're removing an entire resource in a package processor - this next one filters the list of Academy Award nominees to those who won both the Oscar and an Emmy award:

```python
from dataflows import Flow, load, dump_to_path

def find_double_winners(package):

# Remove the emmies resource -
# we're going to consume it now
package.pkg.remove_resource('emmies')
# Must yield the modified datapackage
yield package.pkg

# Now iterate on all resources
resources = iter(package)

# Emmies is the first -
# read all its data and create a set of winner names
emmy = next(resources)
emmy_winners = set(
map(lambda x: x['nominee'],
filter(lambda x: x['winner'],
emmy))
)

# Oscars are next -
# filter rows based on the emmy winner set
academy = next(resources)
yield filter(lambda row: (row['Winner'] and
row['Name'] in emmy_winners),
academy)

f = Flow(
# Emmy award nominees and winners
load('emmy.csv', name='emmies'),
# Academy award nominees and winners
load('academy.csv', encoding='utf8', name='oscars'),
find_double_winners,
dump_to_path('double_winners')
)
_ = f.process()

# -->
# double_winners/academy.csv contains:
# 1931/1932,5,Actress,1,Helen Hayes,The Sin of Madelon Claudet
# 1932/1933,6,Actress,1,Katharine Hepburn,Morning Glory
# 1935,8,Actress,1,Bette Davis,Dangerous
# 1938,11,Actress,1,Bette Davis,Jezebel
# ...
```

## Builtin Processors

DataFlows comes with a few built-in processors which do most of the heavy lifting in many common scenarios -
leaving you to implement only the minimum code that is specific to your specific problem.

### Load and Save Data
#### load
Loads data from various source types (local files, remote URLS, Google Spreadsheets, databases...)

#### printer
Just prints whatever it sees. Good for debugging.

#### dump_to_path
Store the results to a specified path on disk, in a valid datapackage

#### dump_to_zip
Store the results in a valid datapackage, all files archived in one zip file

#### dump_to_sql
Store the results in a relational database (creates one or more tables or updates existing tables)

### Manipulate row-by-row
#### delete_fields.py
Removes some columns for the data

#### add_computed_field
Adds new fields whose values are based on existing columns

#### find_replace.py
Look for specific patterns in specific fields and replace them with new data

#### set_type.py
Parse incoming data based on provided schema, validate the data in the process

### Manipulate the entire resource
#### sort_rows.py
Sort incoming data based on key

#### unpivot.py
Unpivot a table - convert one row with multiple value columns to multiple rows with one value column

#### filter_rows.py
Filter rows based on inclusive and exclusive value filters

### Manipulate package
#### add_metadata.py
Add high-level metadata about your package

#### concatenate.py
Concatenate multiple streams of data to a single one, resolving differently named columns along the way

#### duplicate.py
Duplicate a single stream of data to make two streams

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.6
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

0.5.5

Apr 1, 2024

0.5.4

Mar 22, 2024

0.5.3

Mar 22, 2024

0.5.2

Mar 22, 2024

0.5.1

Mar 22, 2024

0.5.0

Mar 20, 2024

0.4.14

Mar 13, 2024

0.4.12

Mar 13, 2024

0.4.11

Mar 13, 2024

0.4.10

Mar 13, 2024

0.4.9

Mar 12, 2024

0.4.8

Mar 12, 2024

0.4.7

Mar 12, 2024

0.4.5

Oct 11, 2023

0.4.3

Sep 26, 2023

0.4.2

Sep 26, 2023

0.4.1

Sep 26, 2023

0.4.0

Jul 19, 2023

0.3.23

Jul 18, 2023

0.3.22

Apr 17, 2023

0.3.20

Feb 21, 2023

0.3.19

Feb 20, 2023

0.3.18

Feb 20, 2023

0.3.16

Aug 18, 2022

0.3.15

Jul 31, 2022

0.3.14

Jul 26, 2022

0.3.13

Jul 4, 2022

0.3.12

May 29, 2022

0.3.11

Jan 26, 2022

0.3.8

Oct 18, 2021

0.3.7

Oct 17, 2021

0.3.6

Oct 17, 2021

0.3.4

Oct 6, 2021

0.3.3

Sep 30, 2021

0.3.2

Sep 24, 2021

0.3.1

Aug 23, 2021

0.3.0

Aug 22, 2021

0.2.18

Aug 4, 2021

0.2.17

May 31, 2021

0.2.16

May 15, 2021

0.2.15

May 14, 2021

0.2.14

May 14, 2021

0.2.13

May 3, 2021

0.2.12

Apr 12, 2021

0.2.11

Apr 7, 2021

0.2.10

Apr 6, 2021

0.2.9

Mar 27, 2021

0.2.8

Mar 21, 2021

0.2.7

Mar 15, 2021

0.2.5

Feb 17, 2021

0.2.4

Feb 17, 2021

0.2.3

Feb 17, 2021

0.2.2

Dec 22, 2020

0.2.1

Dec 6, 2020

0.2.0

Nov 23, 2020

0.1.15

Nov 17, 2020

0.1.14

Nov 17, 2020

0.1.13

Nov 8, 2020

0.1.12

Nov 7, 2020

0.1.11

Nov 5, 2020

0.1.10

Oct 20, 2020

0.1.9

Oct 16, 2020

0.1.8

Oct 11, 2020

0.1.7

Oct 7, 2020

0.1.6

Aug 23, 2020

0.1.5

Aug 11, 2020

0.1.4

Jul 30, 2020

0.1.3

Jul 29, 2020

0.1.2

Jun 21, 2020

0.1.1

Jun 13, 2020

0.1.0

May 26, 2020

0.0.74

May 25, 2020

0.0.73

May 25, 2020

0.0.72

May 15, 2020

0.0.71

Feb 20, 2020

0.0.68

Feb 5, 2020

0.0.67

Jan 19, 2020

0.0.66

Jan 14, 2020

0.0.65

Dec 26, 2019

0.0.64

Nov 17, 2019

0.0.63

Oct 8, 2019

0.0.62

Oct 7, 2019

0.0.60

Oct 3, 2019

0.0.59

Oct 3, 2019

0.0.58

Sep 2, 2019

0.0.57

Jul 2, 2019

0.0.56

Jun 16, 2019

0.0.55

May 27, 2019

0.0.54

May 27, 2019

0.0.53

May 23, 2019

0.0.52

May 13, 2019

0.0.51

May 2, 2019

0.0.50

Apr 28, 2019

0.0.49

Apr 28, 2019

0.0.48

Apr 6, 2019

0.0.47

Apr 5, 2019

0.0.46

Mar 30, 2019

0.0.45

Mar 25, 2019

0.0.44

Mar 9, 2019

0.0.43

Mar 9, 2019

0.0.42

Mar 9, 2019

0.0.39

Jan 20, 2019

0.0.38

Jan 13, 2019

0.0.37

Nov 27, 2018

0.0.36

Nov 26, 2018

0.0.35

Nov 22, 2018

0.0.34

Nov 22, 2018

0.0.33

Nov 18, 2018

0.0.32

Oct 29, 2018

0.0.31

Oct 21, 2018

0.0.30

Oct 19, 2018

0.0.29

Oct 18, 2018

0.0.28

Oct 17, 2018

0.0.27

Oct 17, 2018

0.0.26

Oct 17, 2018

0.0.25

Oct 17, 2018

0.0.24

Oct 17, 2018

0.0.23

Oct 16, 2018

0.0.22

Oct 16, 2018

0.0.21

Oct 16, 2018

0.0.20

Oct 15, 2018

0.0.19

Oct 10, 2018

0.0.18

Oct 10, 2018

0.0.17

Oct 10, 2018

0.0.16

Oct 10, 2018

0.0.15

Oct 9, 2018

0.0.14

Oct 8, 2018

0.0.13

Oct 7, 2018

0.0.12

Oct 3, 2018

0.0.11

Oct 3, 2018

0.0.10

Sep 13, 2018

0.0.9

Sep 13, 2018

0.0.8

Sep 8, 2018

0.0.7

Aug 1, 2018

0.0.6

Jul 12, 2018

0.0.5

Jul 7, 2018

0.0.4

Jul 7, 2018

0.0.3

Jun 27, 2018

This version

0.0.2

Jun 20, 2018

0.0.1

Jun 7, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflows-0.0.2.tar.gz (21.5 kB view hashes)

Uploaded Jun 20, 2018 Source

Hashes for dataflows-0.0.2.tar.gz

Hashes for dataflows-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`746ad3db83ccdc4961ff1773c454f3bdb47f47a79c6283aa1978167c91906ca2`
MD5	`241b9f4caef8076109a4b536707de899`
BLAKE2b-256	`96a37daef5d0e89645067fa864e596671125e376731fb4f1c2df34aeb7c48e3e`