Skip to main content

A small library to preprocess diary/transaction data for time series models.

Project description

Time Series Cleaning Package

This repository contains a small Python package, timeseries_cleaner, designed to simplify the transformation of diary or transactional data into a format suitable for time–series forecasting models. The goal of the package is to encapsulate the repetitive data munging steps so you can focus on building and evaluating your models.

Features

  • Flexible column mapping – Configure which columns in your raw DataFrame represent dates, values and identifiers using a PreprocessConfig. This means you can reuse the same functions across different datasets without editing any code.
  • Automatic weekly aggregation – Raw events are aggregated by week, missing weeks are filled with zeros and the resulting timeline is aligned across entities.
  • Sliding window generation – Fixed length sequences of past observations (lags) are extracted along with the next value as the target. Windows containing all zeros or missing targets are automatically discarded.
  • Demographic merging – Seamlessly join static attributes (e.g. gender, age) to the generated sequences via a single helper.
  • Train/test splitting – Hold out the most recent weeks for evaluation, ensuring that each test window has sufficient history behind it.

Basic Usage

import pandas as pd
from timeseries_cleaner import load_data, preprocess_data, merge_demographics, train_test_split, PreprocessConfig

# Load your data (CSV or Excel). Column names will be normalised to lower
df = load_data("Income Report - Mon Apr 3 2023.xlsx", sheet_name="Income Reports")

# Select and rename the relevant columns from the full report
dt = df[[
    "respondent id", "gender", "age", "number of children",
    "marital status", "country of residence", "income report amount",
    "income report date created"
]].rename(columns={
    "respondent id": "id",
    "gender": "gender",
    "age": "age",
    "number of children": "children",
    "marital status": "marital",
    "country of residence": "country",
    "income report amount": "amount",
    "income report date created": "date"
})

demographic_cols = ["id", "gender", "age", "children", "marital", "country"]
demographics = dt[demographic_cols].drop_duplicates("id")

config = PreprocessConfig(
    date_col="date",
    value_col="amount",
    id_col="id",
    window=6,
)

# Split into training and testing sets and process each
train, test, full = train_test_split(dt, config=config, weeks_back=3, demographics=demographics)

print(train.head())

Updating the Package

The package is intentionally small and easy to extend. You can add additional helper functions or modify existing ones simply by editing the modules inside timeseries_cleaner/. No special tooling is required: the package does not depend on any external libraries beyond Pandas, which is installed by default in most data science environments.

If you wish to distribute or install this package into your own projects, consider adding a minimal setup.py or pyproject.toml. For the purposes of this exercise the files have been arranged so that you can import directly from the local directory without installation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lift_timeseries_cleaner-0.1.1.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lift_timeseries_cleaner-0.1.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file lift_timeseries_cleaner-0.1.1.tar.gz.

File metadata

  • Download URL: lift_timeseries_cleaner-0.1.1.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for lift_timeseries_cleaner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 de66ee719d12b4af3ca6dd56350cffbdc9f9a72b1523a1c65cee18464a2cb178
MD5 71caa2d22eb71e6d901e92c29d4fef43
BLAKE2b-256 622308e5a6c25e1755c0cd61351b21df4ab6dcb479d538461bb4c4d279f272ad

See more details on using hashes here.

File details

Details for the file lift_timeseries_cleaner-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for lift_timeseries_cleaner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ab7dfb7ccd71435b1cf3dce1e837dc6eed2d7c93b39ec2a48b181384d59a8a35
MD5 23c31c347c3527561d6e8d1a54a6f9a9
BLAKE2b-256 6bf100104b6001479326a918993d0651445ed79904f339df7cb75ca48ba394e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page