A flexible, configuration-driven data pipeline for asset-pricing research.
Project description
paper-data: Data Ingestion & Preprocessing for Asset Pricing Research 📊
paper-data is a core component of the P.A.P.E.R (Platform for Asset Pricing Experimentation and Research) monorepo. It provides a robust, flexible, and configuration-driven pipeline for ingesting raw financial and economic data, performing essential wrangling operations, and exporting clean, processed datasets ready for modeling and portfolio construction.
Built with Polars for high performance and memory efficiency, paper-data streamlines the often complex and time-consuming process of data preparation in quantitative finance.
✨ Features
- Modular Data Connectors: Seamlessly ingest data from various sources:
- 📁 Local Files: Load data from local CSV files (
CSVLoader). - 📝 Google Sheets: Download and cache public Google Sheets (
GoogleSheetConnector). - 🔒 WRDS: Execute SQL queries on Wharton Research Data Services and cache results locally (
WRDSConnector).
- 📁 Local Files: Load data from local CSV files (
- Comprehensive Wrangling Operations: Apply common data transformations declaratively via a YAML configuration:
- Monthly Imputation: Fill missing numeric values with cross-sectional medians and categorical values with modes.
- Min-Max Scaling: Normalize features to a specified range (e.g.,
[-1, 1]) on a monthly cross-sectional basis. - Dummy Variable Generation: Create one-hot encoded (dummy) columns from a categorical feature (e.g., industry codes).
- Dataset Merging: Combine different datasets (e.g., firm-level with macro-level data) using various join types.
- Lagging/Leading: Create lagged or lead versions of columns for time-series analysis, with support for panel data grouping.
- Interaction Terms: Generate interaction features between different sets of columns (e.g., firm characteristics and macro indicators).
- Configuration-Driven Pipeline: Define your entire data pipeline (ingestion, wrangling, export) in a human-readable YAML file, promoting reproducibility and ease of experimentation.
- Performance-Optimized: Leverages the speed and efficiency of the Polars DataFrame library for all data manipulation tasks, including support for lazy (out-of-core) execution for memory-intensive operations.
- Flexible Export: Export processed data to the efficient Parquet format, with optional partitioning by year for easy downstream consumption by the modeling pipeline.
- Integrated Logging: Detailed logs are written to a file, providing transparency and debugging capabilities without cluttering the console.
🚀 Installation
paper-data is designed to be part of the larger PAPER monorepo. You can install it as an optional dependency of paper-asset-pricing or as a standalone package.
Recommended (as part of paper-asset-pricing):
This method ensures paper-data is available to the main paper CLI orchestrator.
pip install "paper-asset-pricing[data]"
Standalone Installation:
If you only need paper-data and its core functionalities for a different project.
pip install paper-data
From Source (for development within the monorepo):
Navigate to the root of your PAPER monorepo and install paper-data in editable mode.
pip install -e ./paper-data
📖 Usage Example: Synthetic Data Pipeline
This example demonstrates how to use paper-data to process synthetic firm-level and macro-economic data.
1. Project Setup & Data Generation
First, ensure you have initialized a project using paper init ThesisExample. For this example, we'll assume your project directory ThesisExample/ is at the root of the monorepo.
Navigate to the paper-data/examples/synthetic_data directory and generate the raw CSV files:
# Assuming you are in the monorepo root
cd paper-data/examples/synthetic_data
# Generate synthetic firm and macro data
python firm_synthetic.py
python macro_synthetic.py
This will create firm_synthetic.csv and macro_synthetic.csv.
2. Data Configuration (data-config.yaml)
Create a data-config.yaml file in your project's configs directory (e.g., ThesisExample/configs/data-config.yaml). This file defines the entire data processing pipeline.
# ThesisExample/configs/data-config.yaml
ingestion:
- name: "firm_data_raw"
path: "firm_synthetic.csv" # Path relative to ThesisExample/data/raw
format: "csv"
date_column: { "date": "%Y%m%d" }
firm_id_column: "permco"
to_lowercase_cols: true
- name: "macro_data_raw"
path: "macro_synthetic.csv" # Path relative to ThesisExample/data/raw
format: "csv"
date_column: { "date": "%Y%m%d" }
to_lowercase_cols: true
wrangling_pipeline:
- operation: "monthly_imputation"
dataset: "firm_data_raw"
numeric_columns: [ "volume", "marketcap" ]
output_name: "firm_data_imputed"
- operation: "merge"
left_dataset: "firm_data_imputed"
right_dataset: "macro_data_raw"
on: [ "date" ]
how: "left"
output_name: "merged_data"
- operation: "lag"
dataset: "merged_data"
periods: 1
columns_to_lag:
- method: "all_except"
columns: [ "date", "permco", "return", "volume", "marketcap" ]
drop_original_cols_after_lag: false
restore_names: false
drop_generated_nans: true
output_name: "panel_with_lags"
- operation: "create_macro_interactions"
dataset: "panel_with_lags"
macro_columns: [ "gdp_growth_lag_1", "cpi_lag_1", "unemployment_lag_1" ]
firm_columns: [ "marketcap" ]
drop_macro_columns: false
output_name: "final_panel_data"
export:
- dataset_name: "final_panel_data"
output_filename_base: "processed_panel_data"
format: "parquet"
partition_by: "year" # 'year' or 'none' are supported
Important: Copy the generated CSV files into your project's raw data directory.
# From the monorepo root
cp paper-data/examples/synthetic_data/*.csv ThesisExample/data/raw/
3. Running the Data Pipeline
The intended way to run the pipeline is with the paper-asset-pricing CLI from within your project directory.
# Navigate to your project directory from the monorepo root
cd ThesisExample
# Execute the data phase
paper execute data
4. Expected Output
Console Output:
The console output is minimal, confirming the process and directing you to the logs.
>>> Executing Data Phase <<<
Data phase completed successfully. Additional information in 'ThesisExample/logs.log'
ThesisExample/logs.log Content (Snippet):
The log file provides a detailed, step-by-step account of the pipeline's execution.
INFO - Starting Data Phase for project: ThesisExample
INFO - Using data configuration: /path/to/monorepo/ThesisExample/configs/data-config.yaml
INFO - Running data pipeline for project: /path/to/monorepo/ThesisExample
INFO - --- Ingesting Data ---
INFO - Dataset 'firm_data_raw' ingested. Shape: (125, 5)
INFO - Dataset 'macro_data_raw' ingested. Shape: (25, 4)
INFO - --- Wrangling Data ---
INFO - --- Wrangling Step 1: monthly_imputation ---
INFO - Input Dataset: 'firm_data_raw'
INFO - Numeric Columns: ['volume', 'marketcap']
INFO - Output Dataset: 'firm_data_imputed'
INFO - --- Wrangling Step 2: merge ---
INFO - Left Dataset: 'firm_data_imputed' (Shape: (125, 5))
INFO - Right Dataset: 'macro_data_raw' (Shape: (25, 4))
INFO - -> Merge complete. New dataset 'merged_data' shape: (125, 8)
INFO - --- Wrangling Step 3: lag ---
INFO - Input Dataset: 'merged_data'
INFO - Periods: 1
INFO - Columns to Lag: ['gdp_growth', 'cpi', 'unemployment']
INFO - -> Lag operation complete. New dataset 'panel_with_lags' shape: (120, 11)
INFO - --- Wrangling Step 4: create_macro_interactions ---
INFO - Input Dataset: 'panel_with_lags'
INFO - Macro Columns: ['gdp_growth_lag_1', 'cpi_lag_1', 'unemployment_lag_1']
INFO - Firm Columns: ['marketcap']
INFO - -> Eager macro-firm interaction creation complete. New dataset 'final_panel_data' shape: (120, 14)
INFO - --- Exporting Data ---
INFO - Found eager dataset 'final_panel_data' for export.
INFO - Exporting 'final_panel_data' by year to separate files:
INFO - Exported data for year 2024 to '.../ThesisExample/data/processed/processed_panel_data_2024.parquet'.
INFO - Exported data for year 2025 to '.../ThesisExample/data/processed/processed_panel_data_2025.parquet'.
INFO - Data pipeline completed successfully.
5. Processed Data Output
After successful execution, you will find the processed Parquet files in your project's data/processed directory:
ThesisExample/data/processed/
├── processed_panel_data_2024.parquet
└── processed_panel_data_2025.parquet
⚙️ Configuration Reference
The data-config.yaml file is the heart of paper-data. Here's a breakdown of its main sections:
ingestion
A list of datasets to ingest. Each item defines a source:
name(string, required): A unique identifier for the dataset within the pipeline.format(string, required): The ingestion format. Supports"csv","google_sheet","wrds","google_drive".- For
csv:path(string, required): Relative path to the raw data file (fromproject_root/data/raw/).
- For
google_sheet/google_drive:url(string, required): The full URL to the shareable resource.
- For
wrds:query(string, required): The SQL query to execute.
date_column(object, required): Specifies the date column and its format. E.g.,{ "date": "%Y%m%d" }.firm_id_column(string, optional): The column name for the firm identifier (e.g., "permco").to_lowercase_cols(boolean, optional, default:false): Whether to convert all column names to lowercase.
wrangling_pipeline
A sequential list of operations to apply to your datasets.
operation: "monthly_imputation"dataset(string, required): The name of the dataset to apply imputation to.numeric_columns(list, optional): Columns to impute with monthly cross-sectional median.categorical_columns(list, optional): Columns to impute with monthly cross-sectional mode.output_name(string, required): The name for the resulting dataset.
operation: "scale_to_range"dataset(string, required): The name of the dataset to scale.range(object, required): Defines the targetminandmax. E.g.,{ min: -1, max: 1 }.cols_to_scale(list, required): Numeric columns to apply min-max scaling to.output_name(string, required): The name for the resulting dataset.
operation: "merge"left_dataset&right_dataset(string, required): Names of the datasets to merge.on(list, required): Columns to merge on.how(string, required): Join type ("left","inner", etc.).output_name(string, required): Name for the merged dataset.
operation: "lag"dataset(string, required): The dataset to use.periods(integer, required): Number of periods to shift.columns_to_lag(list, required): Defines which columns to lag. The only supportedmethodis"all_except".drop_original_cols_after_lag(boolean, optional): Iftrue, original columns are dropped.restore_names(boolean, optional): Iftrueanddrop_original_cols_after_lagistrue, renames lagged columns to their original names.drop_generated_nans(boolean, optional): Iftrue, drops rows with NaNs introduced by lagging.output_name(string, required): The name for the resulting dataset.
operation: "create_macro_interactions"dataset(string, required): The dataset to use.macro_columns&firm_columns(list, required): Lists of columns to interact.use_lazy_engine(boolean, optional): Iftrue, uses Polars' lazy API to reduce memory usage.output_name(string, required): The name for the resulting dataset.
export
A list of processed datasets to export.
dataset_name(string, required): The name of the dataset to export.output_filename_base(string, required): The base name for the output file(s).format(string, required): Currently supports"parquet".partition_by(string, optional): How to partition the output. Supports"year"or"none".
🤝 Contributing
We welcome contributions to paper-data! If you have suggestions for new data connectors, wrangling operations, or performance improvements, please feel free to open an issue or submit a pull request.
📄 License
paper-data is distributed under the MIT License. See the LICENSE file for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_data-0.1.2.tar.gz.
File metadata
- Download URL: paper_data-0.1.2.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e4972ba4f7c7309cc3741137bce8baf1253430a2e438d95b0ea90058a64cabe
|
|
| MD5 |
ab5e0c26e93266a1369b790def7c12b1
|
|
| BLAKE2b-256 |
344a33b0d8b9c7ee627d2e4b234e2cf62f9cde3593fcf2059193d25e76e644ee
|
File details
Details for the file paper_data-0.1.2-py3-none-any.whl.
File metadata
- Download URL: paper_data-0.1.2-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ee9708177afa6d523ccd2b205d253ef32edab1cba2cd003d30576fe91e19b34
|
|
| MD5 |
d5eda7525a6b4ec9ec2f0ec485c7b81a
|
|
| BLAKE2b-256 |
c77ef43c25ebeebcbfde473098653673368f48de655e6f5e9abd96195acb18bf
|