Unifying an inconsistently coded categorical variable in a panel/longitudinal dataset.
Project description
cat2cat
Handling an Inconsistent Coded Categorical Variable in a Longitudinal Dataset
cat2cat provides a statistical solution for harmonising categorical variables whose encoding changes between survey waves or data releases. If you work with longitudinal data where classification schemes evolve (occupations, diseases, industries, products, or fields of education), this package helps produce valid cross-temporal analyses.
The Problem
Real-world classifications change. When one classification replaces another, a single old code may map to multiple new codes (and vice versa). Naive responses are unsatisfactory: separate analyses by period limit comparability, manual recoding is hard to reproduce, and ignoring coding changes can bias results.
The Solution
cat2cat maps a categorical variable using a transition table between two time points. The transition table should list candidate categories for each code in the period being harmonised. When one observed code corresponds to several target categories, cat2cat replicates the observation across candidates and assigns probability weights using frequencies or ML-based predictions.
The method follows a replication-and-weighting procedure that:
- Replicates each observation onto all candidate categories from the mapping table for a chosen direction.
- Assigns probability weights that sum to 1 per original observation.
- Preserves distributional properties of non-mapped variables for valid downstream analysis.
NOTE: If you have a fully linked panel where each subject appears in both periods and target-period categories are directly available, probabilistic harmonisation may be unnecessary.
cat2cat() is most useful when direct linking is incomplete (for example repeated cross-sections, rotational panels, entrants/leavers).
Value Added of cat2cat
cat2cat separates true structural change from coding-system change.
After harmonisation, you can:
- Track trends within groups across waves.
- Compare subgroup dynamics on one consistent coding scheme.
- Estimate models with group effects/interactions.
- Run sensitivity checks across weighting assumptions.
Direction
You can harmonise in both directions:
Forward Mapping (Old -> New)
Backward Mapping (New -> Old)
Key Features
| Feature | Benefit |
|---|---|
| Probability weights | Naive, frequency, and ML-based weights |
| ML validation | cat2cat_ml_run() reports accuracy, Brier, and mean P(true class) |
| Multi-period chaining | Harmonise 3+ waves iteratively |
| Regression support | summary_c2c() adjusts inference for replicated-data workflows |
| Aggregated workflows | Harmonisation tools for grouped data use-cases |
References
- Method: Nasinski, Majchrowska & Broniatowska (2020)
- Software: Nasinski & Gajowniczek (2023)
Ecosystem
| R Package | CRAN, production-ready |
| Python Package | PyPI |
| Documentation | API and guides |
Documentation
- Get Started - core concepts and a two-period workflow.
- Choosing Weights and Validating ML - weight strategy and ML validation.
- Advanced Workflows - multi-period and advanced usage patterns.
Installation
pip install cat2cat
Quick Start
from pandas import concat
from cat2cat import cat2cat, cat2cat_ml_run
from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
from cat2cat.datasets import load_occup, load_trans
from sklearn.ensemble import RandomForestClassifier
occup = load_occup()
trans = load_trans()
old = occup.loc[occup.year == 2008, :].copy()
new = occup.loc[occup.year == 2010, :].copy()
data = cat2cat_data(old=old, new=new, cat_var_old="code", cat_var_new="code", time_var="year")
mappings = cat2cat_mappings(trans=trans, direction="backward")
c2c = cat2cat(data=data, mappings=mappings)
harmonised = concat([c2c["old"], c2c["new"]])
new["edu_group"] = new["edu"].astype(str)
old["edu_group"] = old["edu"].astype(str)
ml = cat2cat_ml(
data=new,
cat_var="code",
features=["salary", "age", "edu_group"],
models=[RandomForestClassifier(n_estimators=50, random_state=1234)],
on_fail="freq",
fail_warn=True,
)
diagnostics = cat2cat_ml_run(mappings=mappings, ml=ml)
print(diagnostics)
Citation
If you use cat2cat in your research, please cite:
Nasinski M, Gajowniczek K (2023). "cat2cat: Handling an Inconsistently Coded
Categorical Variable in a Longitudinal Dataset." SoftwareX, 24, 101525.
doi:10.1016/j.softx.2023.101525
@article{nasinski2023cat2cat,
title={cat2cat: Handling an Inconsistently Coded Categorical Variable in a Longitudinal Dataset},
author={Nasinski, Maciej and Gajowniczek, Krzysztof},
journal={SoftwareX},
volume={24},
pages={101525},
year={2023},
doi={10.1016/j.softx.2023.101525}
}
Contributing
Interested in contributing? Check the contributing guidelines and code of conduct.
License
cat2cat is licensed under Apache License 2.0. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cat2cat-0.4.4.tar.gz.
File metadata
- Download URL: cat2cat-0.4.4.tar.gz
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a153debd6cab49a2c69141728a30bb399f641b9d5c03b82de32175b344b4d7c6
|
|
| MD5 |
bce079d71c3997dbe9cc4995bdcc2d9a
|
|
| BLAKE2b-256 |
a5659de3b6ad067542262ce01858a29b78216ca453d8e0c76c4b003193cfd3ad
|
File details
Details for the file cat2cat-0.4.4-py3-none-any.whl.
File metadata
- Download URL: cat2cat-0.4.4-py3-none-any.whl
- Upload date:
- Size: 2.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f8662f2f2ae3e99afd054d8219c2522f3fb029ca4a356a27990c34796654712
|
|
| MD5 |
c4f7bc1c188150b070e865a02e8c5852
|
|
| BLAKE2b-256 |
8e78704b0355b93f620bc31611c0e23bf7e75b7b6a3266a69cb8f3c51be4556c
|