Skip to main content

DataProcessor extends the piping abilities for pandas DataFrame with a DateTimeIndex by wrapping it in a DataProcessor class

Project description

dataprocessor

Introduction

dataprocessor extends the piping abilities for a pandas DataFrame when the index is a DateTimeIndex[^1] by wrapping it in a DataProcessor class. In order to retrieve the piping result, we access the data member on the DataProcessor instance.

Operations on the dataframe are chained using () and/or [] operators, each with a different meaning. Arguments to () will operate on the entire data frame, arguments to [] will seek to summarize the data by distinct, non-overlapping blocks of time[^2].

The argument to the () operator can be either:

  • A function taking the data frame as its argument and returning a new data frame[^3] The function can be any function including:
    • lambdas
    • user defined functions
    • member functions[^4]
  • An instance of a class derived from FilterInterface which implements an apply function returning a subset on the index of the data frame[^5]

In the case of the latter, the library provides a TimeFreqFilter with necessary implementation details[^6] as a convenience. For the former, the nature of the functions is usually to extract, filter, summarize, transform, etc. but any functionality is allowed provided the function returns a pandas DataFrame[^7].

The [] operator takes 3 arguments

  • TimeFreqFilter indicating the interval length to summarize over
  • list of functions to apply within each interval on a column of the data frame
  • the column in the data frame to apply the functions to

Installation

dataprocessor was built using python 3.9. It is available as a package on pypi and can be installed through pip:

pip install dataprocessor

Should you require an installation of pip, follow the instructions on the pip website.

Examples

The easiest way to understand is to dive in with a series of examples. First let us set up an example DataFrame with the Faker library:

import pandas as pd
import numpy as np
from datetime import datetime
from functools import partial
from dataprocessor.feed_filter import TimeFreqFilter
from dataprocessor.constants import TimePeriod
from dataprocessor.data_processor import DataProcessor
from faker import Faker

fake = Faker()

num_obs = 20000
data = pd.DataFrame(np.random.randn(num_obs*2).reshape(num_obs,2), columns=["Return","Px"], index=[fake.date_time_between_dates(
    datetime_start=datetime(2020, 3, 13, 14, 58, 57), 
    datetime_end=datetime(2020, 3, 20, 14, 58, 57), 
    tzinfo=None)
    for x in range(num_obs)]).sort_index()
data.Px = np.abs(data.Px)+10.0

Assume we have a rolling mean function, but now we wish to apply a 5-period rolling mean to the last observation right before every 15 minute interval:

def rolling_mean(x, col_name, n):
    return pd.DataFrame(x[col_name].shift(1).rolling(window=n).mean(), index=x.index)
z = DataProcessor(data)(TimeFreqFilter(TimePeriod.MINUTE, 15))(rolling_mean, col_name="Return", n=5).data
# pp(z.Return['2020-03-13 19:55:49.743080':'2020-03-15 13:00:00.866140'])

Get the observations between 8:30 AM and 4:30 PM at every hour and rename a column

z2 = DataProcessor(data)("between_time", '08:30', '16:30')(TimeFreqFilter(TimePeriod.HOUR, 1))(
    lambda x: x.rename(columns={"Return": "RETURN"})).data

On all business days get the difference in price from 15:59 to 16:30.

z3 = DataProcessor(data)("between_time", '15:59', '16:30')(TimeFreqFilter(TimePeriod.BUSINESS_DAY))(
    lambda x: x.iloc[-1,x.columns.get_loc("Px")]-x.iloc[0,x.columns.get_loc("Px")])

As an illustration, here are some methods for filtering between two times:

z2 = DataProcessor(data). 
    between_time('08:30', '16:30'). 
    ("between_time", '09:15', '15:30').
    (partial(lambda x, y, z: z.loc[x:y], '2020-03-13 08:00', '2020-03-17 08:00')).data

Next, starting at 8:15 AM on 15 Mar 2020, we take summary data for 5 minute intervals consisting of first, max, min, last, median, mean and standard deviation of the Return column. We then rename the columns and keep the intervals with observations.

z2 = DataProcessor(data)[
        TimeFreqFilter(TimePeriod.MINUTE, 5, starting=datetime(2020, 3, 15, 8, 15, 0)),
        [DataProcessor.first, np.max, np.min, DataProcessor.last, np.median, np.mean, np.std], 
        "Return"](
    lambda x: x.rename(columns={'amax': 'HIGH', 'amin': 'LOW', 'mean': 'MEAN',
                                 'median': 'MEDIAN', 'first': 'OPEN', 
                                'last': 'CLOSE', 'std': 'STD'}))(
    lambda x: x[~np.isnan(x.MEAN)]).data

[^1]: Even though the library focuses on a DateTimeIndex, there is nothing stopping users from using the functionality on pandas DataFrames with different indices; including providing their own classes as filters provided they implement an apply method.

[^2]: The convenience member function summarize_intervals as syntactic sugar for a call to []

[^3]: The returned data frame will then be the input of any subsequent chaining

[^4]: A member function of the pandas DataFrame can be called by specifying it as a string in order to distinguish it from other functions in the local or global namespace

[^5]: The input frame of any subsequent chaining will be the subset matching the returned indices

[^6]: Such as starting index, whether the indices returned should be the ones right before/after the time intervals or both, etc.

[^7]: The DataProcessor provides convenience functions for some commonly used operations, but all are syntactic devices.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprocessor-0.0.20.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

dataprocessor-0.0.20-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file dataprocessor-0.0.20.tar.gz.

File metadata

  • Download URL: dataprocessor-0.0.20.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for dataprocessor-0.0.20.tar.gz
Algorithm Hash digest
SHA256 fb7943387bba2a0d96ca21d1ced9b0734537cfcad98e6d37a97f80c94c77729d
MD5 3bf750909fb431faa2c64f69939ddffd
BLAKE2b-256 ff927989803dd52e78e8d4f1cf3cada3a38b81039f05d4480a7fcd0736bc91e7

See more details on using hashes here.

File details

Details for the file dataprocessor-0.0.20-py3-none-any.whl.

File metadata

File hashes

Hashes for dataprocessor-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 03ff43320e03d44592f977da4fffd5528e4ffcb9bb13f50fa1e7d8e652693800
MD5 dcffe450ca0b360635bc53fd4f1a18d3
BLAKE2b-256 8fb897cb42da6d32dbaf935a74642622d91c4e0ff4e067e62935741004b80368

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page