A library that executes SortingHat feature type inference on Pandas dataframes
Project description
SortingHatInf
SortingHatInf is a library that implements ML-based feature type inference as seen in the paper here. Feature type inference is the task of predicting the feature types of the columns of a given dataset.
Library for ML feature type inference: https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/MLFeatureTypeInference.
Feature Types
SortingHat
numeric
categorical
datetime
sentence
url
embedded-number
list
not-generalizable
context-specific
Extended
Same as SortingHat except:
numeric
mapped tointeger
orfloating
categorical
mapped toboolean
if Boolean
ARFF
Nominal-specification
(Categorical)INTEGER
REAL
(Float)STRING
IGNORE
(Not-Generalizable)
Example Usage with OpenML
Here, we run feature type inference on a dataset obtained from OpenML. Note: this can be done with any dataset loaded as a Pandas dataframe, but we use OpenML here as an example.
- First ensure
pip
,wheel
, andsetuptools
are up-to-date.
python -m pip install --upgrade pip setuptools wheel
- Install the package using python-pip.
pip install sortinghatinf
- Import the library.
import sortinghatinf
- Install the OpenML python API.
pip install openml
- Import the OpenML python library.
import openml
- Load the 'Blood Transfusion Service Center' dataset from OpenML (dataset_id=31). Note: This requires an OpenML account which you can setup by following this link.
data = openml.datasets.get_dataset(dataset_id=31)
X, _, _, _ = data.get_data() # Loaded as Pandas dataframe
- Infer the feature types for the data columns.
# Infer the SortingHat feature types.
infer_sh = sortinghatinf.get_sortinghat_types(X)
# Infer the extended feature types.
infer_ext = sortinghatinf.get_expanded_feature_types(X)
# Infer the ARFF feature types.
# The function `get_feature_types_as_arff()` also returns the SortingHat feature types.
infer_arff, infer_sh = sortinghatinf.get_feature_types_as_arff(X)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sortinghatinf-0.0.3.tar.gz
(7.7 MB
view hashes)
Built Distribution
Close
Hashes for sortinghatinf-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 071919e4c47f1d9a12d776dcc35a45af60f69c9ac18c0435fc39687fe065d49c |
|
MD5 | 62001e6364ac1e65a21024a8ffa5ae13 |
|
BLAKE2b-256 | 5940e622fd92473b1420cb522bfbece14b116121c754647b46fb93831b83f088 |