Skip to main content

Unifying an inconsistently coded categorical variable in a panel/longtitudal dataset.

Project description

cat2cat

cat2cat logo
Build Status codecov pypi

Unifying an inconsistently coded categorical variable in a panel/longtitudal dataset

There is offered the cat2cat procedure to map a categorical variable according to a mapping (transition) table between two different time points. The mapping (transition) table should to have a candidate for each category from the targeted for an update period. The main rule is to replicate the observation if it could be assigned to a few categories, then using simple frequencies or statistical methods to approximate probabilities of being assigned to each of them.

This algorithm was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020)).

For more details please read the paper by (Nasinski, Gajowniczek (2023)).

Installation

$ pip install cat2cat

Usage

For more examples and descriptions please vist the example notebook

load example data

# cat2cat datasets
from cat2cat.datasets import load_trans, load_occup
trans = load_trans()
occup = load_occup()

Low-level functions

from cat2cat.mappings import get_mappings, get_freqs, cat_apply_freq

# convert the mapping table to two association lists
mappings = get_mappings(trans)
# get a variable levels freqencies
codes_new = occup.code[occup.year == 2010].values
freqs = get_freqs(codes_new)
# apply the frequencies to the (one) association list
mapp_new_p = cat_apply_freq(mappings["to_new"], freqs)

# mappings for a specific category
mappings["to_new"]['3481']
# probability mappings for a specific category
mapp_new_p['3481']

cat2cat function

from cat2cat import cat2cat
from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml

from pandas import concat

# split the panel by the time variale
# here only two periods
o_old = occup.loc[occup.year == 2008, :].copy()
o_new = occup.loc[occup.year == 2010, :].copy()

# dataclasses, core arguments for the cat2cat function
data = cat2cat_data(
    old = o_old, 
    new = o_new,
    cat_var_old = "code", 
    cat_var_new = "code", 
    time_var = "year"
)
mappings = cat2cat_mappings(trans = trans, direction = "backward")

# apply the cat2cat procedure
c2c = cat2cat(data = data, mappings = mappings)
# pandas.concat used to bind per period datasets
data_final = concat([c2c["old"], c2c["new"]])

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

cat2cat was created by Maciej Nasinski. It is licensed under the terms of the MIT license.

Credits

cat2cat was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat2cat-0.1.7.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat2cat-0.1.7-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file cat2cat-0.1.7.tar.gz.

File metadata

  • Download URL: cat2cat-0.1.7.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for cat2cat-0.1.7.tar.gz
Algorithm Hash digest
SHA256 5d62ddbc4208e2afefc12ffd91ff6a850d6ddb623ef1bbce91198d9cdb8cb2dd
MD5 2e159df23e2cd4767092f26ce168b4a8
BLAKE2b-256 c3544569d6691a5a4ba1329c9ed97bf2b8d1d0e9822d7b846a53fcec124122d1

See more details on using hashes here.

File details

Details for the file cat2cat-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: cat2cat-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for cat2cat-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1adac6cb4a82071fc22cd47a3cd1b4ad0f82342f4c7bb6f68227da6967739cba
MD5 26856dc15e26f47e6a39a0b2127a8549
BLAKE2b-256 3f5598f255dda5ee7b4dadcd8b303f8dcb96b3444848455034c0db919eb8edee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page