Unifying an inconsistently coded categorical variable in a panel/longtitudal dataset.
Project description
cat2cat
Unifying an inconsistently coded categorical variable in a panel/longtitudal dataset
There is offered the cat2cat procedure to map a categorical variable according to a mapping (transition) table between two different time points. The mapping (transition) table should to have a candidate for each category from the targeted for an update period. The main rule is to replicate the observation if it could be assigned to a few categories, then using simple frequencies or statistical methods to approximate probabilities of being assigned to each of them.
This algorithm was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020)).
For more details please read the paper by (Nasinski, Gajowniczek (2023)).
Installation
$ pip install cat2cat
Usage
For more examples and descriptions please vist the example notebook
load example data
# cat2cat datasets
from cat2cat.datasets import load_trans, load_occup
trans = load_trans()
occup = load_occup()
Low-level functions
from cat2cat.mappings import get_mappings, get_freqs, cat_apply_freq
# convert the mapping table to two association lists
mappings = get_mappings(trans)
# get a variable levels freqencies
codes_new = occup.code[occup.year == 2010].values
freqs = get_freqs(codes_new)
# apply the frequencies to the (one) association list
mapp_new_p = cat_apply_freq(mappings["to_new"], freqs)
# mappings for a specific category
mappings["to_new"]['3481']
# probability mappings for a specific category
mapp_new_p['3481']
cat2cat function
from cat2cat import cat2cat
from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
from pandas import concat
# split the panel by the time variale
# here only two periods
o_old = occup.loc[occup.year == 2008, :].copy()
o_new = occup.loc[occup.year == 2010, :].copy()
# dataclasses, core arguments for the cat2cat function
data = cat2cat_data(
old = o_old,
new = o_new,
cat_var_old = "code",
cat_var_new = "code",
time_var = "year"
)
mappings = cat2cat_mappings(trans = trans, direction = "backward")
# apply the cat2cat procedure
c2c = cat2cat(data = data, mappings = mappings)
# pandas.concat used to bind per period datasets
data_final = concat([c2c["old"], c2c["new"]])
Contributing
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
License
cat2cat
was created by Maciej Nasinski. It is licensed under the terms of the MIT license.
Credits
cat2cat
was created with cookiecutter
and the py-pkgs-cookiecutter
template.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cat2cat-0.1.6.tar.gz
.
File metadata
- Download URL: cat2cat-0.1.6.tar.gz
- Upload date:
- Size: 2.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f6e394434ad3242f45a8338fad91c9f2dfb8064c4f589dff3d228184ad3aa50 |
|
MD5 | 55beb590f45ca00df79364b4f5245849 |
|
BLAKE2b-256 | 3419c2dd628001b628bba0d043ca75a8992ec4033335d45d9af1d25ea8f7bb18 |
File details
Details for the file cat2cat-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: cat2cat-0.1.6-py3-none-any.whl
- Upload date:
- Size: 2.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bff1e794304ff83c0ab6266d29685e5cda4311adf6e33014e8352e50a7bc67b |
|
MD5 | b1f6de7e0a8ed939bd3e42754eb31dff |
|
BLAKE2b-256 | fb764497606e9bf0c1390c882e633293bb1e5b965e5fa61bfec5c8c43324d657 |