Skip to main content

Probabilistic type inference

Project description

build-publish on release build on develop PyPI version Documentation status Downloads Binder

1 Introduction

This repository provides the source code of a Python package for ptype and its extension ptype-cat.

1.1 ptype

ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.

Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation.png

Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.

If you use this package, please cite the ptype paper, using the following BibTeX entry:

@article{ceritli2020ptype,
  title={ptype: probabilistic type inference},
  author={Ceritli, Taha and Williams, Christopher K I and Geddes, James},
  journal={Data Mining and Knowledge Discovery},
  year={2020},
  volume = {34},
  number = {3},
  pages={870–-904},
  doi = {10.1007/s10618-020-00680-1},
}

1.2 ptype-cat

A weakness of ptype is that it does not handle well type inference for categorical variables which are non-Boolean. For example, most existing methods including ptype treat the “Class Name” and “Rating” columns in the example below as string and integer types respectively, rather than categoricals. Therefore the user needs to manually convert their assigned types.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation-ptype-cat.png

The data on the left-hand side are sampled from a dataset about clothing.

To (semi-)automate this manual task, we introduce ptype-cat, which is an extension of ptype to enable detection of the general categorical type, including the non-Boolean categorical variables. ptype-cat combines the output of ptype with additional features such as the number of unique values in a column, and runs a Logistic Regression classifier to determine whether a column denotes a categorical variable or not when a column is labeled with the integer or string type by ptype.

Please see the ptype-cat paper for the details of ptype-cat, for which you can use the following BibTeX entry to cite:

@inproceedings{ptype-cat,
  title={ptype-cat: Inferring the Type and Values of Categorical Variables},
  author={Ceritli, Taha and Williams, Christopher K I},
  booktitle={21st ECML-PKDD Automating Data Science Workshop},
  year={2021},
}

2 Install requirements

You can simply install ptype from PyPI:

pip install ptype

3 Usage

See demo notebooks in notebooks folder. View them online via Binder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ptype-0.2.17.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

ptype-0.2.17-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file ptype-0.2.17.tar.gz.

File metadata

  • Download URL: ptype-0.2.17.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.7.1 requests/2.26.0 setuptools/58.3.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for ptype-0.2.17.tar.gz
Algorithm Hash digest
SHA256 d1adde98cc105025cc47f2806edfee0c6e3bdf7ab1e4f9420f7367c16b49d2f7
MD5 e4a6ee16beaf8fb56d2fc683a0ee150b
BLAKE2b-256 9409dda468dbae2d432b03d436e9c592c8c98c2004dc09e40a8ae404e41050b0

See more details on using hashes here.

File details

Details for the file ptype-0.2.17-py3-none-any.whl.

File metadata

  • Download URL: ptype-0.2.17-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.7.1 requests/2.26.0 setuptools/58.3.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for ptype-0.2.17-py3-none-any.whl
Algorithm Hash digest
SHA256 5796ea97aaff25ab673216c3d27f82490a7a30fc6a917928ca83e0944076ef89
MD5 83d05b986932f7e49a8ca3ba083b472d
BLAKE2b-256 3d3de0a625aac2ef3d62693174d8e4310c71bd6c0307ded9222b71cf956dec62

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page