Probabilistic type inference
Project description
1 Introduction
ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.
Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.
ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.
If you use this package, please cite the ptype paper, using the following BibTeX entry:
@article{ceritli2020ptype, title={ptype: probabilistic type inference}, author={Ceritli, Taha and Williams, Christopher KI and Geddes, James}, journal={Data Mining and Knowledge Discovery}, year={2020}, volume = {34}, number = {3}, pages={870–-904}, doi = {10.1007/s10618-020-00680-1}, }
2 Install requirements
You can simply install ptype from PyPI:
pip install ptype
3 Usage
See demo notebooks in notebooks folder. View them online via Binder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.