A Pythonic query language for time series data
Project description
TimeSeriesQL
A Pythonic query language for time series data
Table of Contents
About The Project
There are many time series databases and each have their own query language. Each platform takes time to invest in learning the structure and keywords of that language and often the skills learned don't translate to other platforms. The goal of this project is to create a time series specific library that can be used across many different time series databases as well as easy to learn because it uses Python syntax.
Built With
Getting Started
To get a local copy up and running follow these simple steps.
Prerequisites
The requirements are in the requirements.txt file.
Installation
pip
pip install timeseriesql
manual
- Clone the timeseriesql
git clone https:://github.com/mbeale/timeseriesql.git
- Install library
cd timeseriesql
python setup.py install
Usage
The way this project works is to provide a general framework for querying a time series with pluggable backends that communicate with specific time series databases. The queries are created using Python generators, a formatt familiar to Pythonistas.
data = Query(x for x in "metric.name" if x.some_label = "some_value").by("a_label")[start:end:resolution]
The return value is a TimeSeries
object that uses a Numpy array as backend. That object can have
ufuncs
and other numpy functions applied against it. More examples to come.
CSV Backend Usage
Often time series data is loaded from a CSV file. The backend expects the first column to be the time index in either a numerical timestamp or strings in ISO 8601 date or datetime format. The filters are applied to the headers of the CSV. If labels are not in the CSV and are supplied as part of the query, then filters will not be applied.
If any columns are empty or don't contain a numeric value, the value becomes a np.nan
.
Basic CSV Usage
from timeseriesql.backends import CSVBackend
data = CSVBackend(x for x in "path/to.csv")[:]
Basic CSV Filtering
For CSV files the labels are the column headers. If there are columns that are not needed, they can be filtered out.
from timeseriesql.backends import CSVBackend
data = CSVBackend(x for x in "path/to.csv" if x.label == "A")[:]
data = CSVBackend(x for x in "path/to.csv" if x.label != "B")[:]
data = CSVBackend(x for x in "path/to.csv" if x.label in ["B", "C", "G"])[:]
data = CSVBackend(x for x in "path/to.csv" if x.label not in ["B", "C", "G"])[:]
Set the Labels
from timeseriesql.backends import CSVBackend
data = CSVBackend(x for x in "path/to.csv").labels(
[
{"label": "one"},
{"label": "two"},
{"label": "three"},
{"label": "four"},
{"label": "five"},
{"label": "six"},
{"label": "seven"},
]
)[:]
AppOptics Backend Usage
Appoptics is a commercial time series database product. The backend converts a query into an API call.
The backend expects a APPOPTICS_TOKEN
environment variable to be set in order to authenticate to AppOptics.
AppOptics Query
from timeseriesql.backends import AOBackend
data = AOBackend(x for x in "metric.name")[3600:] #basic
data = AOBackend(x * 100 for x in "metric.name")[3600:] #binary operations (+, -, /, *)
data = AOBackend(x * 1.8 + 32 for x in "metric.name")[3600:] #multiple binary operations (°C to °F)
data = AOBackend(x.max for x in "metric.name")[3600:] #get max value
AppOptics Filtering
Currently only ==
is supported.
from timeseriesql.backends import AOBackend
data = AOBackend(x for x in "metric.name" if x.environment == 'production')[3600:]
AppOptics Grouping
from timeseriesql.backends import AOBackend
data = AOBackend(x for x in "metric.name").group('environment')[3600:]
data = AOBackend(x - y for x,y in AOBackend((x.max for x in "metric1"), (x.min for x in "metric2")).by('tag1'))[3600:]
AppOptics Time
from timeseriesql.backends import AOBackend
data = AOBackend(x for x in "metric.name")[:] #no start or end time (not recommended)
data = AOBackend(x for x in "metric.name")[3600:] #from now - 3600 seconds until now, resolution of 1
data = AOBackend(x for x in "metric.name")[3600:1800] #from now - 3600 seconds until now - 1800 seconds, resolution of 1
data = AOBackend(x for x in "metric.name")[3600::300] #from now - 3600 seconds until now resoultion of 300 seconds
TimeSeries Usage
The TimeSeries
object is allows for manipulation of the time series data after the it's been queried from the
backend. There are also helper functions to convert to a pandas DataFrame
and plot using matplotlib.
In the following examples, the variables starting with ts_
are assumed to be queried data from a backend.
TimeSeries Operations
# Basic mathematical operations (+, -, /, *)
ts_1 + 5 # will return a new series
ts_1 += 5 #will perform operation in place
ts_1 += ts_2 #add together two TimeSeries
TimeSeries Time Index
The time index is a array of floats but there is a built in method to convert the floats into np.datetime64
.
ts_1.time # array of floats
ts_1.time.dt #array of np.datetime64
TimeSeries Merging
TimeSeries
objects can be combined but the ending time indexes must be the same. This may require empty values
to be created where the indexes don't align.
new_t = ts_1.merge([ts_2, ts_3])
TimeSeries Grouping/Reducing
If there are multiple streams, they can be grouped and merged by the labels.
reduced = ts_1.group(["hostname", "name"]).add()
reduced = ts_1.group("env").mean()
reduced = ts_1.group("env").mean(axis=None) #setting the access to None will get the mean of the entire object
TimeSeries Special Indexing
import numpy as np
beg = np.datetime64('2019-02-25T03:00')
end = np.datetime64('2019-02-25T04:00')
ts_1[beg:end] # set a time range
ts_1[beg : np.timedelta64(3, "m")] # fetch from beginning + 3 minutes
ts_1[np.timedelta64(3, "m") :] #start from beginning + 3 minutes
ts_1[: np.timedelta64(3, "m")] #end at the end - 3 minutes
ts_1[{"hostname": "host2"}] # by labels
TimeSeries Rolling Windows
The rolling_window
method assumes that the data is filled and at a fixed resolution. Number of periods is
an integer and not a time range.
rolling_cum_sum = ts_1.rolling_window(12).add() #rolling cumsum
rolling_mean = ts_1.rolling_window(12).mean() # rolling mean
rolling = ts_1.rolling_window(12).median() #rolling median
TimeSeries Resample
The resample
method allows a smaller period to be aggregated into a larger period.
resampled = ts_1.resample(300).mean() #resamples to 5 minutes and takes the mean
TimeSeries to Pandas
The conversion returns 2 pandas DataFrames, one for the labels and the other for the data.
data, labels = ts_1.to_pandas()
TimeSeries Matplotlib
There is a helper function that will apply some sane defaults to a plotting function for a TimeSeries object.
ts_1.plot(legend=True)
plt.show()
Roadmap
See the open issues for a list of proposed features (and known issues).
Contributing
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Michael Beale - michael.beale@gmail.com
Project Link: https://github.com/mbeale/timeseriesql
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for timeseriesql-0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0b1e6a9cbd74668675459961886824b6628729e64004ced8e76fc50d3dc0698 |
|
MD5 | 5fdfab03e639153fa3ec32d3c732416a |
|
BLAKE2b-256 | 743fd78d91176b5f6f1fdeebe45773ef1ec9826822b90c0dcb1ad14df914308c |