Hydro

On-the-fly data manipulation framework

Project description

Hydro
=====

Hydro is a free and open source Data API computation and serving framework, which designed mainly to help web/application servers or other data consumers to extract data from different data streams, process it on the fly and render it to different clients/applications based on criteria and statistics.

```

|-------|
| DB1 |======
|-------| =
. = Data API
= |-----------|
. = > | HYDRO - | |---------|
= | Extract | | APP/Web | |--------|
===ETL===> . = | Transform | =====> | Server | =====> | Client |
= > | Render | |---------| |--------|
. = |-----------|
=
|-------| =
| DBn |======
|-------|

```

## Hydro makes it easy to:

1. Consolidate into **one service** a logic of processing different types of inbound streams from [speed and batch](http://lambda-architecture.net/) layers.
2. Optimize data retrieval by performing various types of optimization and transformation techniques during run time such as:
* Sampling.
* Deciding on data access path (pre materialized / raw data).
* Data streams operations: lookup-ing, aggregations, computed columns and etc.
* Resource allocation per user/query/client/QOS.
3. Create multi level caching.
4. Reuse and share business logic across different consumers and data streams.

Hydro is built in a way that data/biz logic is separated from data extraction, in that way Hydro can define different data structures to extract data from **but** apply the same processing logic, once data is fetched.

## Building blocks:

**Topology:**

Topology is a definition of a processing logic:

* Defining the input data streams
* Operations on data streams
* Rendering the output

Example:

```
#defining main stream
main_stream = self.query_engine.get('geo_widget_stream', params)

#defining lookup stream
lookup_stream = self.query_engine.get('geo_lookup_stream', params, cache_ttl=1)

#combining 2 streams based on field user_id for both stream
combined = self.transformers.combine(main_stream, lookup_stream, left_on=['user_id'], right_on=['user_id'])

#aggregating by 'country' and create 2 computed columns for the summary of 'revenue' and 'spend'
aggregated = self.transformers.aggregate(combined, group_by=['country'], operators={'revenue': 'sum', 'spend': 'sum'})

#creating computed column for calculating ROI
aggregated['ROI'] = aggregated.revenue/(aggregated.spend+aggregated.revenue)

#returning the data set
return aggregated
```

**Query Engine**

Query engine is responsible of connecting to data source and extract data from it.
the query engine is utilizing the Optimizer in order to determine the access path, data structures and optimization logic in order to tap the stream.

**Optimizer**

Optimizer is responsible of applying optimization techniques in order to fetch a stream in the most efficient way based on criteria and statistics. Optimizer returns a plan for the Query Engine to follow.

Example:

```
#creating a plan object
plan = PlanObject(params, source_id, conf)
# defining data source and type
plan.data_source = 'vertica-dash'
plan.source_type = Configurator.VERTICA

# time diff based on input params
time_diff = (plan.TO_DATE - plan.FROM_DATE).total_seconds()

# if time range is bigger than 125 days and application type is dashboard, abort!
# since data need to be fetched quickly
if time_diff > Configurator.SECONDS_IN_DAY*125 and params['APP_TYPE'].to_string() == 'Dashboard':
raise HydroException('Time range is too big')

# else, if average records per day is bigger than 1000 or client is convertro then run sample logic
elif plan.AVG_RECORDS_PER_DAY > 1000 or params['CLIENT_ID'].to_string() == 'convertro':
plan.template_file = 'device_grid_widget_sampling.sql'
plan.sampling = True
self.logger.debug('Sampling for the query has been turn on')

# else run other logic
else:
plan.template_file = 'device_grid_widget.sql'
#return plan object to the query engine
return plan

```

**Caching**

Hydro is using stream and topology based caching, in order to boost performance, in a case the same stream/topology and parameters where fetched before. Streams and topologies can be shared across topologies and in fact, topology can be yet another stream for other topology.

## How to use:
Using Hydro usually involves the following steps:

1. Installation with `pip install hydro`
2. Generating a topology with `hydro_cli scaffold [dir_name] [TopologyName]`
3. Editing the generated files and filling in the logic and queries.
4. Invoke Hydro locally or remotely as explained below.

## Contributing
We are accepting pull requests.

In order to set-up a development environment, all you have to do is to clone this project and run `pip install -r requirements.txt`.
We strongly recommend using virtualenv in order to avoid the dependencies pollute the system's Python installation.

Project details

Release history Release notifications | RSS feed

This version

0.1.7

Mar 22, 2016

0.1.6

Oct 12, 2015

0.1.5

Aug 31, 2015

0.1.4

Aug 9, 2015

0.1.3

Jul 27, 2015

0.1.2

Jul 21, 2015

0.1.1

Apr 19, 2015

0.1.0

Feb 24, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Hydro-0.1.7.tar.gz (22.4 kB view details)

Uploaded Mar 22, 2016 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Hydro-0.1.7-py2.7.egg (85.2 kB view details)

Uploaded Mar 22, 2016 Egg

File details

Details for the file Hydro-0.1.7.tar.gz.

File metadata

Download URL: Hydro-0.1.7.tar.gz
Upload date: Mar 22, 2016
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for Hydro-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`a2638ad204d7701518191ffa45977c0cc52cedb2d65d952fc54e767a094fe73f`
MD5	`384cec2760a28d99f30f6423a869423d`
BLAKE2b-256	`3e2d06204d1d5091494816528f5b30bc64d810f8c64a717f6a63628136a62ee0`

See more details on using hashes here.

File details

Details for the file Hydro-0.1.7-py2.7.egg.

File metadata

Download URL: Hydro-0.1.7-py2.7.egg
Upload date: Mar 22, 2016
Size: 85.2 kB
Tags: Egg
Uploaded using Trusted Publishing? No

File hashes

Hashes for Hydro-0.1.7-py2.7.egg
Algorithm	Hash digest
SHA256	`4942dabaf4e516313b99a834838254cd19a6a1958f06fe88c52e56f10ddc11dc`
MD5	`dde15f81d454b3194aa73225935bf881`
BLAKE2b-256	`bfd3f0d85e4e8cd26ff2dae0d263be287b23bf9bd6056924e87809debdb1f38f`

See more details on using hashes here.

Hydro 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes