Skip to main content

Datapad is a library of lazy data transformations for sequences; similar to spark and linq

Project description

Datapad

Datapad is library for operating on sequences of data using a fluent API. Think of it as an extended and chainable version of Python's itertools library.

This project incorporates ideas from:

install

pip install datapad

Basic usage

create a sequence

from datapad import Sequence
seq = Sequence(range(10))

map:

seq = Sequence(range(10))
seq = seq.map(lambda v: v*2)
seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

filter:

seq = Sequence(range(5))
seq = seq.filter(lambda v: v > 1)
seq.collect()
[2, 3, 4]

pmap_unordered:

seq = Sequence(range(10))
seq = seq.map(lambda v: v*2)
seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

flatmap:

seq = Sequence(range(5))
seq = seq.flatmap(lambda v: [v,v])
seq.collect()
[0, 0, 1, 1, 2, 2, 3, 3, 4, 4]

dropwhile:

seq = Sequence(range(5))
seq = seq.dropwhile(lambda v: v > 1)
seq.collect()
[0, 1]

count:

seq = Sequence(range(5))
seq.count()
5

drop:

seq = Sequence(range(5))
seq = seq.drop(2)
seq.collect()
[2, 3, 4]

sort:

seq = Sequence([2, 1, 0, 4, 3])
seq.sort().collect()
[0, 1, 2, 3, 4]

groupby:

things = [("animal", "lion"), ("plant", "maple tree"), ("animal", "walrus"), ("plant", "grass")]
seq = Sequence(things)
groups = seq.sort().groupby(key=lambda x: x[0], getter=lambda x: x[1])
for key, group in groups:
   print(key, group)
animal ['lion', 'walrus']
plant ['grass', 'maple tree']

Operating on fields within a row of a Sequence

Suppose we have the following sequence

seq = Sequence([
   {'a': 1},
   {'a': 2},
   {'a': 3}
])

Add fields:

import datapad.fields as f
seq = seq.map(f.add({'b': lambda row: row['a'] + 3}))
seq.collect()
[
   {'a': 1, 'b': 4},
   {'a': 2, 'b': 5},
   {'a': 3, 'b': 6}
]

Apply functions to fields:

import datapad.fields as f
seq = seq.map(f.apply({'a': lambda x: x*2)})
seq.collect()
[
   {'a': 2},
   {'a': 4},
   {'a': 6}
]

Select fields from a row:

import datapad as dp
import datapad.fields as f
seq = dp.Sequence([{'a': 1, 'b': 2, 'c': 3}])
seq = seq.map(f.select(['a', 'c']))
seq.collect()
[
   {'a': 1, 'c': 3},
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapad-0.6.1.tar.gz (8.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page