Probabilistic type inference
Project description
1 Introduction
This repository provides the source code of a Python package for ptype and its extension ptype-cat.
1.1 ptype
ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.
Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.
ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.
If you use this package, please cite the ptype paper, using the following BibTeX entry:
@article{ceritli2020ptype, title={ptype: probabilistic type inference}, author={Ceritli, Taha and Williams, Christopher K I and Geddes, James}, journal={Data Mining and Knowledge Discovery}, year={2020}, volume = {34}, number = {3}, pages={870–-904}, doi = {10.1007/s10618-020-00680-1}, }
1.2 ptype-cat
A weakness of ptype is that it does not handle well type inference for categorical variables which are non-Boolean. For example, most existing methods including ptype treat the “Class Name” and “Rating” columns in the example below as string and integer types respectively, rather than categoricals. Therefore the user needs to manually convert their assigned types.
To (semi-)automate this manual task, we introduce ptype-cat, which is an extension of ptype to enable detection of the general categorical type, including the non-Boolean categorical variables. ptype-cat combines the output of ptype with additional features such as the number of unique values in a column, and runs a Logistic Regression classifier to determine whether a column denotes a categorical variable or not when a column is labeled with the integer or string type by ptype.
Please see the ptype-cat paper for the details of ptype-cat, for which you can use the following BibTeX entry to cite:
@inproceedings{ptype-cat, title={ptype-cat: Inferring the Type and Values of Categorical Variables}, author={Ceritli, Taha and Williams, Christopher K I}, booktitle={21st ECML-PKDD Automating Data Science Workshop}, year={2021}, }
2 Install requirements
You can simply install ptype from PyPI:
pip install ptype
3 Usage
See demo notebooks in notebooks folder. View them online via Binder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ptype-0.2.17.tar.gz
.
File metadata
- Download URL: ptype-0.2.17.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.7.1 requests/2.26.0 setuptools/58.3.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1adde98cc105025cc47f2806edfee0c6e3bdf7ab1e4f9420f7367c16b49d2f7 |
|
MD5 | e4a6ee16beaf8fb56d2fc683a0ee150b |
|
BLAKE2b-256 | 9409dda468dbae2d432b03d436e9c592c8c98c2004dc09e40a8ae404e41050b0 |
File details
Details for the file ptype-0.2.17-py3-none-any.whl
.
File metadata
- Download URL: ptype-0.2.17-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.7.1 requests/2.26.0 setuptools/58.3.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5796ea97aaff25ab673216c3d27f82490a7a30fc6a917928ca83e0944076ef89 |
|
MD5 | 83d05b986932f7e49a8ca3ba083b472d |
|
BLAKE2b-256 | 3d3de0a625aac2ef3d62693174d8e4310c71bd6c0307ded9222b71cf956dec62 |