A powerful library for creating time series datasets for machine learning models
Project description
TSDC - Time Series Dataset Creator
A powerful and intuitive Python library for creating time series datasets ready for machine learning models like LSTM, GRU, and Transformers. No more manual data preprocessing - just load your data and start training!
Why TSDC?
When working with time series models (especially LSTM), you always need to:
- Convert raw data into sliding window sequences
- Split data temporally (not randomly!)
- Normalize/scale features properly
- Handle multivariate inputs with single target outputs
- Create proper shapes for neural networks
TSDC automates all of this in just a few lines of code.
Table of Contents
- Installation
- Quick Start
- Core Concepts
- API Reference
- Examples
- Advanced Usage
- Development
- Contributing
Installation
From source (recommended for development)
git clone https://github.com/DeepPythonist/tsdc.git
cd tsdc
pip install -e .
For additional features
pip install -e ".[examples]"
This includes yfinance for financial data loading and matplotlib for visualization.
Quick Start
Basic Example: Single Variable
import numpy as np
from tsdc import TimeSeriesDataset
bitcoin_prices = np.random.randn(1000) * 1000 + 40000
dataset = TimeSeriesDataset(
data=bitcoin_prices,
lookback=60,
horizon=1
)
dataset.prepare()
X_train, y_train = dataset.get_train()
X_val, y_val = dataset.get_val()
X_test, y_test = dataset.get_test()
print(f"X_train shape: {X_train.shape}") # (samples, 60, 1)
print(f"y_train shape: {y_train.shape}") # (samples, 1)
Multivariate Example with Target Column
import pandas as pd
from tsdc import TimeSeriesDataset
data = pd.DataFrame({
'temperature': [...],
'humidity': [...],
'pressure': [...]
})
dataset = TimeSeriesDataset(
data=data,
lookback=24,
horizon=6,
target_column='temperature',
scaler_type='minmax'
)
dataset.prepare()
X_train, y_train = dataset.get_train()
Core Concepts
1. Lookback and Horizon
- lookback: Number of past timesteps to use as input
- horizon: Number of future timesteps to predict
lookback=60, horizon=1 # Use 60 past points to predict next 1 point
lookback=24, horizon=12 # Use 24 hours to predict next 12 hours
2. Stride
Control how windows overlap:
stride=1 # Maximum overlap, windows shift by 1 timestep
stride=5 # Less overlap, windows shift by 5 timesteps
3. Train/Val/Test Splits
IMPORTANT: TSDC uses temporal (sequential) splitting, NOT random splitting!
Time series splitting preserves temporal order to prevent data leakage:
TimeSeriesDataset(
data=data,
train_split=0.7, # First 70% for training (oldest data)
val_split=0.15, # Next 15% for validation (middle data)
test_split=0.15 # Last 15% for testing (newest data)
)
# Train ← Val ← Test (sequential, no shuffling)
# This prevents training on future data and testing on past data!
4. Scaling Options
scaler_type='minmax' # Scale to [0, 1]
scaler_type='standard' # Zero mean, unit variance
scaler_type='robust' # Robust to outliers
scaler_type='none' # No scaling
API Reference
TimeSeriesDataset
Main class for dataset creation.
TimeSeriesDataset(
data: Union[np.ndarray, pd.DataFrame, pd.Series, str],
lookback: int = 10,
horizon: int = 1,
stride: int = 1,
target_column: Optional[Union[int, str]] = None,
scaler_type: str = "minmax",
train_split: float = 0.7,
val_split: float = 0.15,
test_split: float = 0.15
)
Methods:
prepare(preprocess=True): Prepare the datasetget_train(): Returns (X_train, y_train)get_val(): Returns (X_val, y_val)get_test(): Returns (X_test, y_test)get_all(): Returns dictionary with all splitsget_info(): Get dataset informationinverse_transform_predictions(predictions): Convert scaled predictions back
Sequencer
Low-level API for creating sequences.
from tsdc import Sequencer
sequencer = Sequencer(lookback=10, horizon=5, stride=1)
X, y = sequencer.create_sequences(data)
Preprocessor
Standalone preprocessing utilities.
from tsdc import Preprocessor
preprocessor = Preprocessor(
scaler_type='minmax',
handle_missing='forward_fill',
remove_outliers=True,
outlier_threshold=3.0
)
scaled_data = preprocessor.fit_transform(data)
original_data = preprocessor.inverse_transform(scaled_data)
FinancialLoader
Load financial data from Yahoo Finance.
from tsdc.loaders import FinancialLoader
loader = FinancialLoader()
btc_data = loader.load(
symbol="BTC-USD",
start_date="2023-01-01",
end_date="2024-01-01",
source="yahoo"
)
btc_data = loader.add_technical_indicators(
sma_periods=[20, 50],
ema_periods=[12, 26],
rsi_period=14,
macd=True
)
Examples
Example 1: Bitcoin Price Prediction with LSTM
import numpy as np
from tsdc import TimeSeriesDataset
from tsdc.loaders import FinancialLoader
loader = FinancialLoader()
btc_data = loader.load(symbol="BTC-USD", start_date="2022-01-01")
dataset = TimeSeriesDataset(
data=btc_data[['Close', 'Volume']],
lookback=60,
horizon=1,
target_column='Close',
scaler_type='minmax'
)
dataset.prepare()
X_train, y_train = dataset.get_train()
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
model = Sequential([
LSTM(100, return_sequences=True, input_shape=(60, 2)),
Dropout(0.2),
LSTM(50, return_sequences=False),
Dropout(0.2),
Dense(1)
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, batch_size=32)
Example 2: Walk-Forward Validation
from tsdc import TimeSeriesDataset, Sequencer
from tsdc.utils.splitters import walk_forward_validation
data = np.random.randn(1000, 3)
sequencer = Sequencer(lookback=20, horizon=1)
X, y = sequencer.create_sequences(data)
for X_train, y_train, X_test, y_test in walk_forward_validation(X, y, n_splits=5):
model.fit(X_train, y_train)
score = model.evaluate(X_test, y_test)
print(f"Test Score: {score}")
Example 3: Loading from CSV
from tsdc import TimeSeriesDataset
dataset = TimeSeriesDataset(
data="path/to/data.csv",
lookback=30,
horizon=7,
target_column="sales"
)
dataset.prepare()
Example 4: Custom Preprocessing
from tsdc import Preprocessor
preprocessor = Preprocessor(
scaler_type='robust',
handle_missing='interpolate',
remove_outliers=True,
outlier_threshold=2.5
)
cleaned_data = preprocessor.fit_transform(raw_data)
Advanced Usage
Multi-step Forecasting
Predict multiple timesteps ahead:
dataset = TimeSeriesDataset(
data=data,
lookback=48,
horizon=24,
target_column='price'
)
dataset.prepare()
X_train, y_train = dataset.get_train()
Custom Splits with Indices
from tsdc.utils.splitters import expanding_window_split
for X_train, y_train, X_test, y_test in expanding_window_split(
X, y,
initial_train_size=100,
test_size=20,
step=10
):
pass
Inverse Transform Predictions
predictions = model.predict(X_test)
original_scale = dataset.inverse_transform_predictions(predictions)
Project Structure
tsdc/
├── tsdc/
│ ├── __init__.py
│ ├── core/
│ │ ├── dataset.py # Main TimeSeriesDataset class
│ │ ├── sequencer.py # Sliding window operations
│ │ └── preprocessor.py # Data preprocessing
│ ├── loaders/
│ │ ├── base.py # Base loader class
│ │ └── financial.py # Financial data loaders
│ └── utils/
│ ├── validators.py # Input validation
│ └── splitters.py # Time series splitting
├── examples/
│ ├── basic_usage.py # Basic examples
│ ├── lstm_bitcoin.py # Bitcoin prediction
│ └── quick_start.py # Quick start guide
├── tests/
│ └── test_core.py # Unit tests
├── setup.py
├── requirements.txt
└── README.md
Development
Running Tests
pytest tests/ -v
Running Examples
python examples/basic_usage.py
python examples/lstm_bitcoin.py
Code Style
This project follows PEP 8 guidelines. Format your code with:
black tsdc/
flake8 tsdc/
Features
- ✅ Easy sequence creation for LSTM/GRU/Transformer models
- ✅ Built-in preprocessing and normalization
- ✅ Proper train/validation/test splitting for time series
- ✅ Support for univariate and multivariate data
- ✅ Target column selection for multivariate inputs
- ✅ Financial data loaders with technical indicators
- ✅ Walk-forward and expanding window validation
- ✅ Flexible sliding window operations
- ✅ Missing value handling
- ✅ Outlier detection and removal
- ✅ Inverse transform for predictions
- ✅ Multiple scaling methods
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use TSDC in your research, please cite:
@software{tsdc2024,
title={TSDC: Time Series Dataset Creator},
author={DeepPythonist},
year={2024},
url={https://github.com/DeepPythonist/tsdc}
}
Support
For issues and questions:
- Open an issue on GitHub Issues
- Check the
examples/directory for usage examples
Roadmap
- Add more data loaders (crypto, weather, etc.)
- Add data augmentation techniques
- Support for irregular time series
- Integration with PyTorch DataLoader
- Built-in visualization tools
- Automated hyperparameter tuning for lookback/horizon
Made with ❤️ for the ML community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tsdc-0.1.0.tar.gz.
File metadata
- Download URL: tsdc-0.1.0.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c63aa435e9d1550022e12276c735963af473317134a166b9ce18576b06a15d6
|
|
| MD5 |
4f835a50e61b221985fc9c2895a9c3af
|
|
| BLAKE2b-256 |
76a0dba9a0ea20fcab2b966dc38665aacbd5efa289725e814e7f694e2550d0a2
|
Provenance
The following attestation bundles were made for tsdc-0.1.0.tar.gz:
Publisher:
publish.yml on DeepPythonist/tsdc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tsdc-0.1.0.tar.gz -
Subject digest:
2c63aa435e9d1550022e12276c735963af473317134a166b9ce18576b06a15d6 - Sigstore transparency entry: 583049398
- Sigstore integration time:
-
Permalink:
DeepPythonist/tsdc@13b81126a0c66fe3ef6f97ead31d87a9e5ee7415 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/DeepPythonist
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@13b81126a0c66fe3ef6f97ead31d87a9e5ee7415 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tsdc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tsdc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37f773ece2fc9ea0b71f9c9be85a337cd19e5b67274545acee1665f5f7dd039a
|
|
| MD5 |
98519a39a78e1423e32a5d79e3416499
|
|
| BLAKE2b-256 |
c5be04cecba800f171a33f8d92b3184a3eec17e68782322cd78160fd43262328
|
Provenance
The following attestation bundles were made for tsdc-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on DeepPythonist/tsdc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tsdc-0.1.0-py3-none-any.whl -
Subject digest:
37f773ece2fc9ea0b71f9c9be85a337cd19e5b67274545acee1665f5f7dd039a - Sigstore transparency entry: 583049402
- Sigstore integration time:
-
Permalink:
DeepPythonist/tsdc@13b81126a0c66fe3ef6f97ead31d87a9e5ee7415 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/DeepPythonist
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@13b81126a0c66fe3ef6f97ead31d87a9e5ee7415 -
Trigger Event:
release
-
Statement type: