A library that executes SortingHat feature type inference on Pandas dataframes
Project description
SortingHatInf
SortingHatInf is a library that implements ML-based feature type inference as seen in the paper here. Feature type inference is the task of predicting the feature types of the columns of a given dataset.
Library for ML feature type inference: https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/MLFeatureTypeInference.
Feature Types
SortingHat
Numeric
Categorical
Datetime
Sentence
URL
Embedded Number
List
Not-Generalizable
Context-Specific
Extended
Same as SortingHat except:
Numeric
mapped toInteger
orFloating
Categorical
mapped toBoolean
if Boolean
ARFF
Integer
Real
(Float)Nominal-specification
(Categorical)String
Ignore
(Not-Generalizable)
Example Usage with OpenML
Here, we run feature type inference on a dataset obtained from OpenML. Note: this can be done with any dataset loaded as a Pandas dataframe, but we use OpenML here as an example.
- First ensure
pip
,wheel
, andsetuptools
are up-to-date.
python -m pip install --upgrade pip setuptools wheel
- Install the package using python-pip.
pip install sortinghatinf
- Import the library.
import sortinghatinf
- Install the OpenML python API.
pip install openml
- Import the OpenML python library.
import openml
- Load the 'Blood Transfusion Service Center' dataset from OpenML (dataset_id=31). Note: This requires an OpenML account which you can setup by following this link.
data = openml.datasets.get_dataset(dataset_id=31)
X, _, _, _ = data.get_data() # Loaded as Pandas dataframe
- Infer the feature types for the data columns.
# Infer the SortingHat feature types.
infer_sh = sortinghatinf.get_sortinghat_types(X)
# Infer the extended feature types.
infer_ext = sortinghatinf.get_expanded_feature_types(X)
# Infer the ARFF feature types.
# The function `get_feature_types_as_arff()` also returns the SortingHat feature types.
infer_arff, infer_sh = sortinghatinf.get_feature_types_as_arff(X)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sortinghatinf-0.0.2.tar.gz
(7.7 MB
view hashes)
Built Distribution
Close
Hashes for sortinghatinf-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15154ecb3b0d6c4240dfc39fa98762ed672df51af7febb509939ad306eea897c |
|
MD5 | b3d6be2b7b028c66e02dd7e49c4dcd3f |
|
BLAKE2b-256 | e427d4b4701e5152581a934156055a32fc3133d2768150aa706d1b7fb0a35893 |