Skip to main content

Event analytics. Very fast. Will eventually be merged into Quokka

Project description

Joey

Joey is an ultra-fast embedded Python library for complex pattern recognition on time series data. Its API is based on the Pattern Query Language, a new query language that closely resembles Elastic EQL sequence. Assuming you have stock prices in a Polars DataFrame with columns is_local_bottom, is_local_top, timestamp and close, it lets you define a pattern like this to find all ascending triangles patterns.

ascending_triangles_conditions = [('a', "a.is_local_bottom"), # first bottom
('b', """b.is_local_top and b.close > a.close * UPPER"""), # first top
('c', """c.is_local_bottom and c.close < b.close * LOWER and c.close > a.close * UPPER"""), # second bottom, must be higher than first bottom
('d', """d.is_local_top and d.close > c.close * UPPER and abs(d.close / b.close) < UPPER"""), # second top, must be similar to first top
('e', """e.is_local_bottom and e.close < d.close * LOWER and e.close > (c.close - a.close) / (c.timestamp - a.timestamp) * (e.timestamp - a.timestamp) + a.close"""), # third bottom, didn't break support
('f', """f.close > d.close * UPPER""") #breakout resistance
]

Existing systems like SQL Match Recognize lets you do something like this, but there is no library that supports this functionality inside your own program. Joey fills this gap. It abides by the header-only paradigm of C++ development -- you can just take the Python functions contained in this repo, nfa_cep, nfa_interval_cep and vector_interval_cep and use them in your own code. They depend on some utility functions in utils.py. You can also package it up to be a Python library at your own leisure.

API

The API is similar in spirit to SQL Match Recognize, Splunk transaction and Elastic EQL sequence. It is very simple. Let's say you have minutely OHLC data in a Polars DataFrame like this:

>>> data
shape: (96_666, 7)
┌───────────┬────────────┬────────────┬────────────┬────────────┬─────────────────┬──────────────┐
│ row_count ┆ min_close  ┆ max_close  ┆ timestamp  ┆ close      ┆ is_local_bottom ┆ is_local_top │
│ ---       ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---             ┆ ---          │
│ i64       ┆ f32        ┆ f32        ┆ u64        ┆ f32        ┆ bool            ┆ bool         │
╞═══════════╪════════════╪════════════╪════════════╪════════════╪═════════════════╪══════════════╡
│ 0         ┆ 314.25     ┆ 314.720001 ┆ 1609718400 ┆ 314.670013 ┆ false           ┆ false        │
│ 1         ┆ 313.850006 ┆ 314.720001 ┆ 1609718460 ┆ 314.720001 ┆ false           ┆ true         │
│ 2         ┆ 313.820007 ┆ 314.720001 ┆ 1609718520 ┆ 314.470001 ┆ false           ┆ false        │
│ 3         ┆ 313.649994 ┆ 314.720001 ┆ 1609718580 ┆ 314.26001  ┆ false           ┆ false        │
│ …         ┆ …          ┆ …          ┆ …          ┆ …          ┆ …               ┆ …            │

We could detect all ascending triangles that happen within 7200 seconds as follows:

nfa_cep(data, ascending_triangle_conditions, "timestamp", 7200, by = None, fix = "end")
  • data must be a Polars DataFrame.
  • ascending_triangle_conditions is the list of conditions listed above.
  • We then specify the timestamp column timestamp, which must be of integer type (Int32, Int64, UInt32, UInt64). If you have a Datetime column, you could convert it using the epoch time conversions in Polars. data must be presorted on this column.
  • 7200 denotes the time window the pattern must occur.
  • If your data contains multiple groups (e.g. stocks) and you want to find patterns that occur in each group, you can optionally provide the by argument. data must be then presorted by the timestamp column within each group.
  • fix gives two options. start means we will find at least one pattern for each starting row. end means we will find at least one pattern for every ending row. SQL Match Recognize typically adopts start while real feature engineering workloads typically would prefer end, since you want to know whether or not a pattern has occurred with the current row as the end.

A few things to note:

  1. vector_interval_cep and nfa_interval_cep have the exact same API by design.
  2. The conditions are specified by a list of tuples, which specify a list of events that must occur in sequence in the pattern. The first element of each tuple is a name of the event. The second element of the tuple is a SQL predicate following SQLite syntax. The tuple can only contain columns from current events and previous events. It must not contain columns from future events. You can also rewrite such dependencies by just changing the predicate of the future event. All columns must be qualified by the table name. Only the predicate of the first event can be None.

Examples

Check out the included cep.py for some analysis you can do on minutely data of one symbol, daily data of different symbols, and MBO data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pyjoey-0.2.4-py3-none-any.whl (36.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page