DataProcessor extends the piping abilities for pandas DataFrame with a DateTimeIndex by wrapping it in a DataProcessor class
Project description
dataprocessor
Introduction
dataprocessor extends the piping abilities for a pandas DataFrame
when the index is a DateTimeIndex
[^1] by wrapping it in a DataProcessor
class. In order to retrieve the piping result, we access the data
member on the DataProcessor
instance.
Operations on the dataframe are chained using ()
and/or []
operators, each with a different meaning. Arguments to ()
will operate on the entire data frame, arguments to []
will seek to summarize the data by distinct, non-overlapping blocks of time[^2].
The argument to the ()
operator can be either:
- A function taking the data frame as its argument and returning a new data frame[^3]
The function can be any function including:
- lambdas
- user defined functions
- member functions[^4]
- An instance of a class derived from
FilterInterface
which implements anapply
function returning a subset on the index of the data frame[^5]
In the case of the latter, the library provides a TimeFreqFilter
with necessary implementation details[^6] as a convenience.
For the former, the nature of the functions is usually to extract, filter, summarize, transform, etc. but any functionality is allowed provided the function returns a pandas DataFrame
[^7].
The []
operator takes 3 arguments
TimeFreqFilter
indicating the interval length to summarize overlist
of functions to apply within each interval on a column of the data frame- the column in the data frame to apply the functions to
Installation
dataprocessor was built using python 3.9
. It is available as a package on pypi and can be installed through pip:
pip install dataprocessor
Should you require an installation of pip, follow the instructions on the pip website.
Examples
The easiest way to understand is to dive in with a series of examples. First let us set up an example DataFrame with the Faker library:
import pandas as pd
import numpy as np
from datetime import datetime
from functools import partial
from dataprocessor.feed_filter import TimeFreqFilter
from dataprocessor.constants import TimePeriod
from dataprocessor.data_processor import DataProcessor
from faker import Faker
fake = Faker()
num_obs = 20000
data = pd.DataFrame(np.random.randn(num_obs*2).reshape(num_obs,2), columns=["Return","Px"], index=[fake.date_time_between_dates(
datetime_start=datetime(2020, 3, 13, 14, 58, 57),
datetime_end=datetime(2020, 3, 20, 14, 58, 57),
tzinfo=None)
for x in range(num_obs)]).sort_index()
data.Px = np.abs(data.Px)+10.0
Assume we have a rolling mean function, but now we wish to apply a 5-period rolling mean to the last observation right before every 15 minute interval:
def rolling_mean(x, col_name, n):
return pd.DataFrame(x[col_name].shift(1).rolling(window=n).mean(), index=x.index)
z = DataProcessor(data)(TimeFreqFilter(TimePeriod.MINUTE, 15))(rolling_mean, col_name="Return", n=5).data
# pp(z.Return['2020-03-13 19:55:49.743080':'2020-03-15 13:00:00.866140'])
Get the observations between 8:30 AM and 4:30 PM at every hour and rename a column
z2 = DataProcessor(data)("between_time", '08:30', '16:30')(TimeFreqFilter(TimePeriod.HOUR, 1))(
lambda x: x.rename(columns={"Return": "RETURN"})).data
On all business days get the difference in price from 15:59 to 16:30.
z3 = DataProcessor(data)("between_time", '15:59', '16:30')(TimeFreqFilter(TimePeriod.BUSINESS_DAY))(
lambda x: x.iloc[-1,x.columns.get_loc("Px")]-x.iloc[0,x.columns.get_loc("Px")])
As an illustration, here are some methods for filtering between two times:
z2 = DataProcessor(data).
between_time('08:30', '16:30').
("between_time", '09:15', '15:30').
(partial(lambda x, y, z: z.loc[x:y], '2020-03-13 08:00', '2020-03-17 08:00')).data
Next, starting at 8:15 AM on 15 Mar 2020, we take summary data for 5 minute intervals consisting of first, max, min, last, median, mean and standard deviation of the Return column. We then rename the columns and keep the intervals with observations.
z2 = DataProcessor(data)[
TimeFreqFilter(TimePeriod.MINUTE, 5, starting=datetime(2020, 3, 15, 8, 15, 0)),
[DataProcessor.first, np.max, np.min, DataProcessor.last, np.median, np.mean, np.std],
"Return"](
lambda x: x.rename(columns={'amax': 'HIGH', 'amin': 'LOW', 'mean': 'MEAN',
'median': 'MEDIAN', 'first': 'OPEN',
'last': 'CLOSE', 'std': 'STD'}))(
lambda x: x[~np.isnan(x.MEAN)]).data
[^1]: Even though the library focuses on a DateTimeIndex, there is nothing stopping users from using the functionality on pandas DataFrames with different indices; including providing their own classes as filters provided they implement an apply
method.
[^2]: The convenience member function summarize_intervals
as syntactic sugar for a call to []
[^3]: The returned data frame will then be the input of any subsequent chaining
[^4]: A member function of the pandas DataFrame can be called by specifying it as a string in order to distinguish it from other functions in the local or global namespace
[^5]: The input frame of any subsequent chaining will be the subset matching the returned indices
[^6]: Such as starting index, whether the indices returned should be the ones right before/after the time intervals or both, etc.
[^7]: The DataProcessor
provides convenience functions for some commonly used operations, but all are syntactic devices.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dataprocessor-0.0.20.tar.gz
.
File metadata
- Download URL: dataprocessor-0.0.20.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb7943387bba2a0d96ca21d1ced9b0734537cfcad98e6d37a97f80c94c77729d |
|
MD5 | 3bf750909fb431faa2c64f69939ddffd |
|
BLAKE2b-256 | ff927989803dd52e78e8d4f1cf3cada3a38b81039f05d4480a7fcd0736bc91e7 |
File details
Details for the file dataprocessor-0.0.20-py3-none-any.whl
.
File metadata
- Download URL: dataprocessor-0.0.20-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03ff43320e03d44592f977da4fffd5528e4ffcb9bb13f50fa1e7d8e652693800 |
|
MD5 | dcffe450ca0b360635bc53fd4f1a18d3 |
|
BLAKE2b-256 | 8fb897cb42da6d32dbaf935a74642622d91c4e0ff4e067e62935741004b80368 |