Skip to main content

On-the-fly data manipulation framework

Project description

Hydro
=====

Hydro is a free and open source Data API computation and serving framework, which designed mainly to help web/application servers or other data consumers to extract data from different data streams, process it on the fly and render it to different clients/applications based on criteria and statistics.

```

|-------|
| DB1 |======
|-------| =
. = Data API
= |-----------|
. = > | HYDRO - | |---------|
= | Extract | | APP/Web | |--------|
===ETL===> . = | Transform | =====> | Server | =====> | Client |
= > | Render | |---------| |--------|
. = |-----------|
=
|-------| =
| DBn |======
|-------|


```

## Hydro makes it easy to:

1. Consolidate into **one service** a logic of processing different types of inbound streams from [speed and batch](http://lambda-architecture.net/) layers.
2. Optimize data retrieval by performing various types of optimization and transformation techniques during run time such as:
* Sampling.
* Deciding on data access path (pre materialized / raw data).
* Data streams operations: lookup-ing, aggregations, computed columns and etc.
* Resource allocation per user/query/client/QOS.
3. Create multi level caching.
4. Reuse and share business logic across different consumers and data streams.

Hydro is built in a way that data/biz logic is separated from data extraction, in that way Hydro can define different data structures to extract data from **but** apply the same processing logic, once data is fetched.


## Building blocks:

**Topology:**

Topology is a definition of a processing logic:

* Defining the input data streams
* Operations on data streams
* Rendering the output

Example:

```
#defining main stream
main_stream = self.query_engine.get('geo_widget_stream', params)

#defining lookup stream
lookup_stream = self.query_engine.get('geo_lookup_stream', params, cache_ttl=1)

#combining 2 streams based on field user_id for both stream
combined = self.transformers.combine(main_stream, lookup_stream, left_on=['user_id'], right_on=['user_id'])

#aggregating by 'country' and create 2 computed columns for the summary of 'revenue' and 'spend'
aggregated = self.transformers.aggregate(combined, group_by=['country'], operators={'revenue': 'sum', 'spend': 'sum'})

#creating computed column for calculating ROI
aggregated['ROI'] = aggregated.revenue/(aggregated.spend+aggregated.revenue)

#returning the data set
return aggregated
```

**Query Engine**

Query engine is responsible of connecting to data source and extract data from it.
the query engine is utilizing the Optimizer in order to determine the access path, data structures and optimization logic in order to tap the stream.

**Optimizer**

Optimizer is responsible of applying optimization techniques in order to fetch a stream in the most efficient way based on criteria and statistics. Optimizer returns a plan for the Query Engine to follow.

Example:

```
#creating a plan object
plan = PlanObject(params, source_id, conf)
# defining data source and type
plan.data_source = 'vertica-dash'
plan.source_type = Configurator.VERTICA

# time diff based on input params
time_diff = (plan.TO_DATE - plan.FROM_DATE).total_seconds()

# if time range is bigger than 125 days and application type is dashboard, abort!
# since data need to be fetched quickly
if time_diff > Configurator.SECONDS_IN_DAY*125 and params['APP_TYPE'].to_string() == 'Dashboard':
raise HydroException('Time range is too big')

# else, if average records per day is bigger than 1000 or client is convertro then run sample logic
elif plan.AVG_RECORDS_PER_DAY > 1000 or params['CLIENT_ID'].to_string() == 'convertro':
plan.template_file = 'device_grid_widget_sampling.sql'
plan.sampling = True
self.logger.debug('Sampling for the query has been turn on')

# else run other logic
else:
plan.template_file = 'device_grid_widget.sql'
#return plan object to the query engine
return plan

```

**Caching**

Hydro is using stream and topology based caching, in order to boost performance, in a case the same stream/topology and parameters where fetched before. Streams and topologies can be shared across topologies and in fact, topology can be yet another stream for other topology.


## How to use:
Using Hydro usually involves the following steps:

1. Installation with `pip install hydro`
2. Generating a topology with `hydro_cli scaffold [dir_name] [TopologyName]`
3. Editing the generated files and filling in the logic and queries.
4. Invoke Hydro locally or remotely as explained below.


## Contributing
We are accepting pull requests.

In order to set-up a development environment, all you have to do is to clone this project and run `pip install -r requirements.txt`.
We strongly recommend using virtualenv in order to avoid the dependencies pollute the system's Python installation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Hydro-0.1.7.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

Hydro-0.1.7-py2.7.egg (85.2 kB view details)

Uploaded Source

File details

Details for the file Hydro-0.1.7.tar.gz.

File metadata

  • Download URL: Hydro-0.1.7.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for Hydro-0.1.7.tar.gz
Algorithm Hash digest
SHA256 a2638ad204d7701518191ffa45977c0cc52cedb2d65d952fc54e767a094fe73f
MD5 384cec2760a28d99f30f6423a869423d
BLAKE2b-256 3e2d06204d1d5091494816528f5b30bc64d810f8c64a717f6a63628136a62ee0

See more details on using hashes here.

File details

Details for the file Hydro-0.1.7-py2.7.egg.

File metadata

  • Download URL: Hydro-0.1.7-py2.7.egg
  • Upload date:
  • Size: 85.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for Hydro-0.1.7-py2.7.egg
Algorithm Hash digest
SHA256 4942dabaf4e516313b99a834838254cd19a6a1958f06fe88c52e56f10ddc11dc
MD5 dde15f81d454b3194aa73225935bf881
BLAKE2b-256 bfd3f0d85e4e8cd26ff2dae0d263be287b23bf9bd6056924e87809debdb1f38f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page