Skip to main content

A small library to preprocess diary/transaction data for time series models.

Project description

Time Series Cleaning Package

This repository contains a small Python package, timeseries_cleaner, designed to simplify the transformation of diary or transactional data into a format suitable for time–series forecasting models. The goal of the package is to encapsulate the repetitive data munging steps so you can focus on building and evaluating your models.

Features

  • Flexible column mapping – Configure which columns in your raw DataFrame represent dates, values and identifiers using a PreprocessConfig. This means you can reuse the same functions across different datasets without editing any code.
  • Automatic weekly aggregation – Raw events are aggregated by week, missing weeks are filled with zeros and the resulting timeline is aligned across entities.
  • Sliding window generation – Fixed length sequences of past observations (lags) are extracted along with the next value as the target. Windows containing all zeros or missing targets are automatically discarded.
  • Demographic merging – Seamlessly join static attributes (e.g. gender, age) to the generated sequences via a single helper.
  • Train/test splitting – Hold out the most recent weeks for evaluation, ensuring that each test window has sufficient history behind it.

Basic Usage

import pandas as pd
from timeseries_cleaner import load_data, preprocess_data, merge_demographics, train_test_split, PreprocessConfig

# Load your data (CSV or Excel). Column names will be normalised to lower
df = load_data("Income Report - Mon Apr 3 2023.xlsx", sheet_name="Income Reports")

# Select and rename the relevant columns from the full report
dt = df[[
    "respondent id", "gender", "age", "number of children",
    "marital status", "country of residence", "income report amount",
    "income report date created"
]].rename(columns={
    "respondent id": "id",
    "gender": "gender",
    "age": "age",
    "number of children": "children",
    "marital status": "marital",
    "country of residence": "country",
    "income report amount": "amount",
    "income report date created": "date"
})

demographic_cols = ["id", "gender", "age", "children", "marital", "country"]
demographics = dt[demographic_cols].drop_duplicates("id")

config = PreprocessConfig(
    date_col="date",
    value_col="amount",
    id_col="id",
    window=6,
)

# Split into training and testing sets and process each
train, test, full = train_test_split(dt, config=config, weeks_back=3, demographics=demographics)

print(train.head())

Updating the Package

The package is intentionally small and easy to extend. You can add additional helper functions or modify existing ones simply by editing the modules inside timeseries_cleaner/. No special tooling is required: the package does not depend on any external libraries beyond Pandas, which is installed by default in most data science environments.

If you wish to distribute or install this package into your own projects, consider adding a minimal setup.py or pyproject.toml. For the purposes of this exercise the files have been arranged so that you can import directly from the local directory without installation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lift_timeseries_cleaner-0.1.0-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file lift_timeseries_cleaner-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for lift_timeseries_cleaner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3da51836d5170e03e3fe0a9e7217e9ab3941d1e560ed1b944fdadb2844278663
MD5 fd76103b5e42c47e3fa2ebc9d8d00499
BLAKE2b-256 046299ccbdade3f9a85394ceab5b57da0286185344a9511bc4effd8fbf662248

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page