riko · PyPI

A stream processing engine modeled after Yahoo! Pipes.

These details have not been verified by PyPI

Project links

Project description

Index

Introduction

riko is a pure Python library for analyzing and processing streams of structured data. riko has synchronous and asynchronous APIs, supports parallel execution, and is well suited for processing RSS feeds [1]. riko also supplies a command-line interface for executing flows, i.e., stream processors aka workflows.

With riko, you can

Read csv/xml/json/html files
Create text and data based flows via modular pipes
Parse, extract, and process RSS/Atom feeds
Create awesome mashups [2], APIs, and maps
Perform parallel processing via cpus/processors or threads
and much more…

Notes

Requirements

riko has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.

Optional Dependencies

Feature	Dependency	Installation
Async API	Twisted	pip install riko[async]
Accelerated xml parsing	lxml [3]	pip install riko[xml]
Accelerated feed parsing	speedparser [4]	pip install riko[xml]

Notes

Word Count

In this example, we use several pipes to count the words on a webpage.

>>> ### Create a SyncPipe flow ###
>>> #
>>> # `SyncPipe` is a convenience class that creates chainable flows
>>> # and allows for parallel processing.
>>> from riko.collections import SyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> #
>>> # Notes:
>>> #   1. the `detag` option will strip all html tags from the result
>>> #   2. fetch the text contained inside the 'body' tag of the hackernews
>>> #      homepage
>>> #   3. replace newlines with spaces and assign the result to 'content'
>>> #   4. tokenize the resulting text using whitespace as the delimeter
>>> #   5. count the number of times each token appears
>>> #   6. obtain the raw stream
>>> #   7. extract the first word and its count
>>> #   8. extract the second word and its count
>>> #   9. extract the third word and its count
>>> url = 'https://news.ycombinator.com/'
>>> fetch_conf = {
...     'url': url, 'start': '<body>', 'end': '</body>', 'detag': True}  # 1
>>>
>>> replace_conf = {
...     'rule': [
...         {'find': '\r\n', 'replace': ' '},
...         {'find': '\n', 'replace': ' '}]}
>>>
>>> flow = (
...     SyncPipe('fetchpage', conf=fetch_conf)                           # 2
...         .strreplace(conf=replace_conf, assign='content')             # 3
...         .tokenizer(conf={'delimiter': ' '}, emit=True)               # 4
...         .count(conf={'count_key': 'content'}))                       # 5
>>>
>>> stream = flow.output                                                 # 6
>>> next(stream)                                                         # 7
{"'sad": 1}
>>> next(stream)                                                         # 8
{'(': 28}
>>> next(stream)                                                         # 9
{'(1999)': 1}

Motivation

Why I built riko

Yahoo! Pipes [5] was a user friendly web application used to

aggregate, manipulate, and mashup content from around the web

Wanting to create custom pipes, I came across pipe2py which translated a Yahoo! Pipe into python code. pipe2py suited my needs at the time but was unmaintained and lacked asynchronous or parallel processing.

riko addresses the shortcomings of pipe2py but removed support for importing Yahoo! Pipes json workflows. riko contains ~ 40 built-in modules, aka pipes, that allow you to programatically perform most of the tasks Yahoo! Pipes allowed.

Why you should use riko

riko provides a number of benefits / differences from other stream processing applications such as Huginn, Flink, Spark, and Storm [6]. Namely:

a small footprint (CPU and memory usage)
native RSS/Atom support
simple installation and usage
a pure python library with pypy support
builtin modular pipes to filter, sort, and modify streams

The subsequent tradeoffs riko makes are:

not distributed (able to run on a cluster of servers)
no GUI for creating flows
doesn’t continually monitor streams for new data
can’t react to specific events
iterator (pull) based so streams only support a single consumer [7]

The following table summarizes these observations:

library	Stream Type	Footprint	RSS	simple [8]	async	parallel	CEP [9]	distributed
riko	pull	small	√	√	√	√
pipe2py	pull	small	√	√
Huginn	push	med	√		[10]	√	√
Others	push	large	[11]	[12]	[13]	√	√	√

For more detailed information, please check-out the FAQ.

Notes

Usage

riko is intended to be used directly as a Python library.

Fetching feeds

riko can fetch rss feeds from both local and remote filepaths via “source” pipes. Each “source” pipe returns a stream, i.e., an iterator of dictionaries, aka items.

>>> from riko.modules import fetch, fetchsitefeed
>>>
>>> ### Fetch an RSS feed ###
>>> stream = fetch.pipe(conf={'url': 'https://news.ycombinator.com/rss'})
>>>
>>> ### Fetch the first RSS feed found ###
>>> stream = fetchsitefeed.pipe(conf={'url': 'http://arstechnica.com/rss-feeds/'})
>>>
>>> ### View the fetched RSS feed(s) ###
>>> #
>>> # Note: regardless of how you fetch an RSS feed, it will have the same
>>> # structure
>>> item = next(stream)
>>> item.keys()
dict_keys(['title_detail', 'author.uri', 'tags', 'summary_detail', 'author_detail',
           'author.name', 'y:published', 'y:title', 'content', 'title', 'pubDate',
           'guidislink', 'id', 'summary', 'dc:creator', 'authors', 'published_parsed',
           'links', 'y:id', 'author', 'link', 'published'])

>>> item['title'], item['author'], item['id']
('Gravity doesn’t care about quantum spin',
 'Chris Lee',
 'http://arstechnica.com/?p=924009')

Please see the FAQ for a complete list of supported file types and protocols. Please see Fetching data and feeds for more examples.

Synchronous processing

riko can modify streams via the 40 built-in pipes

>>> from riko.collections import SyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> fetch_conf = {'url': 'https://news.ycombinator.com/rss'}
>>> filter_rule = {'field': 'link', 'op': 'contains', 'value': '.com'}
>>> xpath = '/html/body/center/table/tr[3]/td/table[2]/tr[1]/td/table/tr/td[3]/span/span'
>>> xpath_conf = {'url': {'subkey': 'comments'}, 'xpath': xpath}
>>>
>>> ### Create a SyncPipe flow ###
>>> #
>>> # `SyncPipe` is a convenience class that creates chainable flows
>>> # and allows for parallel processing.
>>> #
>>> # The following flow will:
>>> #   1. fetch the hackernews RSS feed
>>> #   2. filter for items with '.com' in the link
>>> #   3. sort the items ascending by title
>>> #   4. fetch the first comment from each item
>>> #   5. flatten the result into one raw stream
>>> #   6. extract the first item's content
>>> #
>>> # Note: sorting is not lazy so take caution when using this pipe
>>>
>>> flow = (
...     SyncPipe('fetch', conf=fetch_conf)               # 1
...         .filter(conf={'rule': filter_rule})          # 2
...         .sort(conf={'rule': {'sort_key': 'title'}})  # 3
...         .xpathfetchpage(conf=xpath_conf))            # 4
>>>
>>> stream = flow.output                                 # 5
>>> next(stream)['content']                              # 6
'Open Artificial Pancreas home:'

Please see alternate workflow creation for an alternative (function based) method for creating a stream. Please see pipes for a complete list of available pipes.

Parallel processing

An example using riko’s parallel API to spawn a ThreadPool [14]

>>> from riko.collections import SyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> fetch_conf = {'url': 'https://news.ycombinator.com/rss'}
>>> filter_rule = {'field': 'link', 'op': 'contains', 'value': '.com'}
>>> xpath = '/html/body/center/table/tr[3]/td/table[2]/tr[1]/td/table/tr/td[3]/span/span'
>>> xpath_conf = {'url': {'subkey': 'comments'}, 'xpath': xpath}
>>>
>>> ### Create a parallel SyncPipe flow ###
>>> #
>>> # The following flow will:
>>> #   1. fetch the hackernews RSS feed
>>> #   2. filter for items with '.com' in the article link
>>> #   3. fetch the first comment from all items in parallel (using 4 workers)
>>> #   4. flatten the result into one raw stream
>>> #   5. extract the first item's content
>>> #
>>> # Note: no point in sorting after the filter since parallel fetching doesn't guarantee
>>> # order
>>> flow = (
...     SyncPipe('fetch', conf=fetch_conf, parallel=True, workers=4)  # 1
...         .filter(conf={'rule': filter_rule})                       # 2
...         .xpathfetchpage(conf=xpath_conf))                         # 3
>>>
>>> stream = flow.output                                              # 4
>>> next(stream)['content']                                           # 5
'He uses the following example for when to throw your own errors:'

Asynchronous processing

To enable asynchronous processing, you must install the async module.

pip install riko[async]

An example using riko’s asynchronous API.

>>> from riko.bado import coroutine, react
>>> from riko.collections import AsyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> fetch_conf = {'url': 'https://news.ycombinator.com/rss'}
>>> filter_rule = {'field': 'link', 'op': 'contains', 'value': '.com'}
>>> xpath = '/html/body/center/table/tr[3]/td/table[2]/tr[1]/td/table/tr/td[3]/span/span'
>>> xpath_conf = {'url': {'subkey': 'comments'}, 'xpath': xpath}
>>>
>>> ### Create an AsyncPipe flow ###
>>> #
>>> # The following flow will:
>>> #   1. fetch the hackernews RSS feed
>>> #   2. filter for items with '.com' in the article link
>>> #   3. asynchronously fetch the first comment from each item (using 4 connections)
>>> #   4. flatten the result into one raw stream
>>> #   5. extract the first item's content
>>> #
>>> # Note: no point in sorting after the filter since async fetching doesn't guarantee
>>> # order
>>> @coroutine
... def run(reactor):
...     stream = yield (
...         AsyncPipe('fetch', conf=fetch_conf, connections=4)  # 1
...             .filter(conf={'rule': filter_rule})             # 2
...             .xpathfetchpage(conf=xpath_conf)                # 3
...             .output)                                        # 4
...
...     print(next(stream)['content'])                          # 5
>>>
>>> try:
...     react(run)
... except SystemExit:
...     pass
Here's how iteration works ():

Cookbook

Please see the cookbook or ipython notebook for more examples.

Notes

Installation

(You are using a virtualenv, right?)

At the command line, install riko using either pip (recommended)

pip install riko

or easy_install

easy_install riko

Please see the installation doc for more details.

Design Principles

The primary data structures in riko are the item and stream. An item is just a python dictionary, and a stream is an iterator of items. You can create a stream manually with something as simple as [{'content': 'hello world'}]. You manipulate streams in riko via pipes. A pipe is simply a function that accepts either a stream or item, and returns a stream. pipes are composable: you can use the output of one pipe as the input to another pipe.

riko pipes come in two flavors; operators and processors. operators operate on an entire stream at once and are unable to handle individual items. Example operators include count, pipefilter, and reverse.

>>> from riko.modules.reverse import pipe
>>>
>>> stream = [{'title': 'riko pt. 1'}, {'title': 'riko pt. 2'}]
>>> next(pipe(stream))
{'title': 'riko pt. 2'}

processors process individual items and can be parallelized across threads or processes. Example processors include fetchsitefeed, hash, pipeitembuilder, and piperegex.

>>> from riko.modules.hash import pipe
>>>
>>> item = {'title': 'riko pt. 1'}
>>> stream = pipe(item, field='title')
>>> next(stream)
{'title': 'riko pt. 1', 'hash': 2853617420}

Some processors, e.g., pipetokenizer, return multiple results.

>>> from riko.modules.tokenizer import pipe
>>>
>>> item = {'title': 'riko pt. 1'}
>>> tokenizer_conf = {'delimiter': ' '}
>>> stream = pipe(item, conf=tokenizer_conf, field='title')
>>> next(stream)
{'tokenizer': [{'content': 'riko'},
   {'content': 'pt.'},
   {'content': '1'}],
 'title': 'riko pt. 1'}

>>> # In this case, if we just want the result, we can `emit` it instead
>>> stream = pipe(item, conf=tokenizer_conf, field='title', emit=True)
>>> next(stream)
{'content': 'riko'}

operators are split into sub-types of aggregators and composers. aggregators, e.g., count, combine all items of an input stream into a new stream with a single item; while composers, e.g., filter, create a new stream containing some or all items of an input stream.

>>> from riko.modules.count import pipe
>>>
>>> stream = [{'title': 'riko pt. 1'}, {'title': 'riko pt. 2'}]
>>> next(pipe(stream))
{'count': 2}

In case you are confused from the “Word Count” example up top, count can return multiple items if you pass in the count_key config option.

>>> counted = pipe(stream, conf={'count_key': 'title'})
>>> next(counted)
{'riko pt. 1': 1}
>>> next(counted)
{'riko pt. 2': 1}

processors are split into sub-types of source and transformer. sources, e.g., itembuilder, can create a stream while transformers, e.g. hash can only transform items in a stream.

>>> from riko.modules.itembuilder import pipe
>>>
>>> attrs = {'key': 'title', 'value': 'riko pt. 1'}
>>> next(pipe(conf={'attrs': attrs}))
{'title': 'riko pt. 1'}

The following table summaries these observations:

type	sub-type	input	output	parallelizable?	creates streams?
operator	aggregator	stream	stream [15]
operator	composer	stream	stream
processor	source	item	stream	√	√
processor	transformer	item	stream	√

If you are unsure of the type of pipe you have, check its metadata.

>>> from riko.modules import fetchpage, count
>>>
>>> fetchpage.async_pipe.__dict__
{'type': 'processor', 'name': 'fetchpage', 'sub_type': 'source'}
>>> count.pipe.__dict__
{'type': 'operator', 'name': 'count', 'sub_type': 'aggregator'}

The SyncPipe and AsyncPipe classes (among other things) perform this check for you to allow for convenient method chaining and transparent parallelization.

>>> from riko.collections import SyncPipe
>>>
>>> attrs = [
...     {'key': 'title', 'value': 'riko pt. 1'},
...     {'key': 'content', 'value': "Let's talk about riko!"}]
>>> flow = SyncPipe('itembuilder', conf={'attrs': attrs}).hash()
>>> flow.list[0]
{'title': 'riko pt. 1',
 'content': "Let's talk about riko!",
 'hash': 1346301218}

Please see the cookbook for advanced examples including how to wire in vales from other pipes or accept user input.

Notes

Command-line Interface

riko provides a command, runpipe, to execute workflows. A workflow is simply a file containing a function named pipe that creates a flow and processes the resulting stream.

CLI Usage

usage: runpipe [pipeid]

description: Runs a riko pipe

positional arguments:

pipeid The pipe to run (default: reads from stdin).

optional arguments:

-h, --help

show this help message and exit

-a, --async

Load async pipe.

-t, --test

Run in test mode (uses default inputs).

CLI Setup

flow.py

from __future__ import print_function
from riko.collections import SyncPipe

conf1 = {'attrs': [{'value': 'https://google.com', 'key': 'content'}]}
conf2 = {'rule': [{'find': 'com', 'replace': 'co.uk'}]}

def pipe(test=False):
    kwargs = {'conf': conf1, 'test': test}
    flow = SyncPipe('itembuilder', **kwargs).strreplace(conf=conf2)
    stream = flow.output

    for i in stream:
        print(i)

CLI Examples

Now to execute flow.py, type the command runpipe flow. You should then see the following output in your terminal:

https://google.co.uk

runpipe will also search the examples directory for workflows. Type runpipe demo and you should see the following output:

Deadline to clear up health law eligibility near 682

Scripts

riko comes with a built in task manager manage.

Setup

pip install riko[develop]

Examples

Run python linter and nose tests

manage lint
manage test

Contributing

Please mimic the coding style/conventions used in this repo. If you add new classes or functions, please add the appropriate doc blocks with examples. Also, make sure the python linter and nose tests pass.

Please see the contributing doc for more details.

Credits

Shoutout to pipe2py for heavily inspiring riko. riko started out as a fork of pipe2py, but has since diverged so much that little (if any) of the original code-base remains.

More Info

Project Structure

┌── benchmarks
│   ├── __init__.py
│   └── parallel.py
├── bin
│   └── run
├── data/*
├── docs
│   ├── AUTHORS.rst
│   ├── CHANGES.rst
│   ├── COOKBOOK.rst
│   ├── FAQ.rst
│   ├── INSTALLATION.rst
│   └── TODO.rst
├── examples/*
├── helpers/*
├── riko
│   ├── __init__.py
│   ├── lib
│   │   ├── __init__.py
│   │   ├── autorss.py
│   │   ├── collections.py
│   │   ├── dotdict.py
│   │   ├── log.py
│   │   ├── tags.py
│   │   └── py
│   ├── modules/*
│   └── twisted
│       ├── __init__.py
│       ├── collections.py
│       └── py
├── tests
│   ├── __init__.py
│   ├── standard.rc
│   └── test_examples.py
├── CONTRIBUTING.rst
├── dev-requirements.txt
├── LICENSE
├── Makefile
├── manage.py
├── MANIFEST.in
├── optional-requirements.txt
├── py2-requirements.txt
├── README.rst
├── requirements.txt
├── setup.cfg
├── setup.py
└── tox.ini

License

riko is distributed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.67.0

Dec 28, 2021

0.66.0

Aug 14, 2020

0.65.0

Aug 14, 2020

0.64.3

Aug 14, 2020

0.64.2

Aug 14, 2020

0.64.1

Aug 14, 2020

0.64.0

Aug 12, 2020

0.63.0

Aug 12, 2020

0.62.2

Jul 30, 2020

0.62.1

Jul 30, 2020

0.62.0

Jul 29, 2020

0.61.4

Jul 29, 2020

0.61.2

Jul 7, 2020

0.61.1

Feb 2, 2020

0.60.4

Sep 13, 2018

0.60.3

Sep 12, 2018

0.60.2

Aug 18, 2018

0.60.1

Aug 18, 2018

0.60.0

May 23, 2018

0.59.1

May 18, 2018

0.59.0

May 18, 2018

0.58.0

May 18, 2018

0.57.0

Aug 31, 2017

0.56.3

Aug 18, 2017

0.56.2

Aug 17, 2017

0.56.1

Aug 17, 2017

0.56.0

Aug 17, 2017

0.55.0

Aug 17, 2017

0.54.1

Aug 17, 2017

0.54.0

Aug 16, 2017

0.53.0

Aug 16, 2017

0.52.3

Aug 12, 2017

0.52.2

Aug 11, 2017

0.52.1

Aug 9, 2017

0.51.0

May 1, 2017

0.50.0

Apr 12, 2017

0.49.2

Apr 12, 2017

0.47.0

Apr 4, 2017

0.46.1

Apr 4, 2017

0.46.0

Apr 4, 2017

0.45.1

Apr 4, 2017

0.45.0

Apr 1, 2017

0.44.0

Apr 1, 2017

0.43.1

Mar 24, 2017

0.43.0

Mar 24, 2017

0.42.0

Mar 24, 2017

0.41.0

Mar 18, 2017

0.40.1

Mar 16, 2017

0.39.0

Mar 11, 2017

0.38.0

Mar 10, 2017

0.37.0

Sep 29, 2016

0.36.0

Sep 29, 2016

0.35.3

Jul 26, 2016

0.35.1

Jul 22, 2016

0.35.0

Jul 19, 2016

0.33.0

Jul 1, 2016

0.32.1

Jun 16, 2016

0.30.0

Jun 15, 2016

0.29.0

Jun 6, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

riko-0.67.0.tar.gz (645.5 kB view details)

Uploaded Dec 28, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

riko-0.67.0-py2.py3-none-any.whl (686.4 kB view details)

Uploaded Dec 28, 2021 Python 2Python 3

File details

Details for the file riko-0.67.0.tar.gz.

File metadata

Download URL: riko-0.67.0.tar.gz
Upload date: Dec 28, 2021
Size: 645.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for riko-0.67.0.tar.gz
Algorithm	Hash digest
SHA256	`498b6706824588f2c9e05290a0da66c2487b439a117426f4fd18fe38477400a7`
MD5	`bc98e1409ba65fa26e1f86e7a64c7f88`
BLAKE2b-256	`a2228f51f14bcd3995fa4093aa094a995f4512e67ead18789df932f3f6f181fd`

See more details on using hashes here.

File details

Details for the file riko-0.67.0-py2.py3-none-any.whl.

File metadata

Download URL: riko-0.67.0-py2.py3-none-any.whl
Upload date: Dec 28, 2021
Size: 686.4 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for riko-0.67.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a8642287038478149da22ac6498be08780232eb7ecce001e1e1831a1ee75891`
MD5	`e8664d5044c35854c65588afcfd90456`
BLAKE2b-256	`d6548b027f7925b39c5e5ab45287f9f44a8f7a99e4c0f6390844c50bee3e4d1a`

See more details on using hashes here.

riko 0.67.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Index

Introduction

Notes

Requirements

Optional Dependencies

Notes

Word Count

Motivation

Why I built riko

Why you should use riko

Notes

Usage

Usage Index

Fetching feeds

Synchronous processing

Parallel processing

Asynchronous processing

Cookbook

Notes

Installation

Design Principles

Notes

Command-line Interface

CLI Usage

CLI Setup

CLI Examples

Scripts

Setup

Examples

Contributing

Credits

More Info

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes