Skip to main content

Unifying an inconsistently coded categorical variable in a panel/longitudinal dataset.

Project description

cat2cat

cat2cat logo

Python build status PyPI codecov

Handling an Inconsistent Coded Categorical Variable in a Longitudinal Dataset

cat2cat provides a statistical solution for harmonising categorical variables whose encoding changes between survey waves or data releases. If you work with longitudinal data where classification schemes evolve (occupations, diseases, industries, products, or fields of education), this package helps produce valid cross-temporal analyses.

The Problem

Real-world classifications change. When one classification replaces another, a single old code may map to multiple new codes (and vice versa). Naive responses are unsatisfactory: separate analyses by period limit comparability, manual recoding is hard to reproduce, and ignoring coding changes can bias results.

The Solution

cat2cat maps a categorical variable using a transition table between two time points. The transition table should list candidate categories for each code in the period being harmonised. When one observed code corresponds to several target categories, cat2cat replicates the observation across candidates and assigns probability weights using frequencies or ML-based predictions.

The method follows a replication-and-weighting procedure that:

  1. Replicates each observation onto all candidate categories from the mapping table for a chosen direction.
  2. Assigns probability weights that sum to 1 per original observation.
  3. Preserves distributional properties of non-mapped variables for valid downstream analysis.

NOTE: If you have a fully linked panel where each subject appears in both periods and target-period categories are directly available, probabilistic harmonisation may be unnecessary. cat2cat() is most useful when direct linking is incomplete (for example repeated cross-sections, rotational panels, entrants/leavers).

Value Added of cat2cat

cat2cat separates true structural change from coding-system change.

After harmonisation, you can:

  • Track trends within groups across waves.
  • Compare subgroup dynamics on one consistent coding scheme.
  • Estimate models with group effects/interactions.
  • Run sensitivity checks across weighting assumptions.

Direction

You can harmonise in both directions:

Forward Mapping (Old -> New)

Forward mapping

Backward Mapping (New -> Old)

Backward mapping

Key Features

Feature Benefit
Probability weights Naive, frequency, and ML-based weights
ML validation cat2cat_ml_run() reports accuracy, Brier, and mean P(true class)
Multi-period chaining Harmonise 3+ waves iteratively
Regression support summary_c2c() adjusts inference for replicated-data workflows
Aggregated workflows Harmonisation tools for grouped data use-cases

References

Ecosystem

R Package CRAN, production-ready
Python Package PyPI
Documentation API and guides

Documentation

Installation

pip install cat2cat

Quick Start

from pandas import concat
from cat2cat import cat2cat, cat2cat_ml_run
from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
from cat2cat.datasets import load_occup, load_trans
from sklearn.ensemble import RandomForestClassifier

occup = load_occup()
trans = load_trans()

old = occup.loc[occup.year == 2008, :].copy()
new = occup.loc[occup.year == 2010, :].copy()

data = cat2cat_data(old=old, new=new, cat_var_old="code", cat_var_new="code", time_var="year")
mappings = cat2cat_mappings(trans=trans, direction="backward")

c2c = cat2cat(data=data, mappings=mappings)
harmonised = concat([c2c["old"], c2c["new"]])

new["edu_group"] = new["edu"].astype(str)
old["edu_group"] = old["edu"].astype(str)
ml = cat2cat_ml(
    data=new,
    cat_var="code",
    features=["salary", "age", "edu_group"],
    models=[RandomForestClassifier(n_estimators=50, random_state=1234)],
    on_fail="freq",
    fail_warn=True,
)

diagnostics = cat2cat_ml_run(mappings=mappings, ml=ml)
print(diagnostics)

Citation

If you use cat2cat in your research, please cite:

Nasinski M, Gajowniczek K (2023). "cat2cat: Handling an Inconsistently Coded
Categorical Variable in a Longitudinal Dataset." SoftwareX, 24, 101525.
doi:10.1016/j.softx.2023.101525
@article{nasinski2023cat2cat,
  title={cat2cat: Handling an Inconsistently Coded Categorical Variable in a Longitudinal Dataset},
  author={Nasinski, Maciej and Gajowniczek, Krzysztof},
  journal={SoftwareX},
  volume={24},
  pages={101525},
  year={2023},
  doi={10.1016/j.softx.2023.101525}
}

Contributing

Interested in contributing? Check the contributing guidelines and code of conduct.

License

cat2cat is licensed under Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat2cat-0.4.4.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat2cat-0.4.4-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file cat2cat-0.4.4.tar.gz.

File metadata

  • Download URL: cat2cat-0.4.4.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for cat2cat-0.4.4.tar.gz
Algorithm Hash digest
SHA256 a153debd6cab49a2c69141728a30bb399f641b9d5c03b82de32175b344b4d7c6
MD5 bce079d71c3997dbe9cc4995bdcc2d9a
BLAKE2b-256 a5659de3b6ad067542262ce01858a29b78216ca453d8e0c76c4b003193cfd3ad

See more details on using hashes here.

File details

Details for the file cat2cat-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: cat2cat-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for cat2cat-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9f8662f2f2ae3e99afd054d8219c2522f3fb029ca4a356a27990c34796654712
MD5 c4f7bc1c188150b070e865a02e8c5852
BLAKE2b-256 8e78704b0355b93f620bc31611c0e23bf7e75b7b6a3266a69cb8f3c51be4556c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page