Skip to main content

A library that executes SortingHat feature type inference on Pandas dataframes

Project description

SortingHatInf

SortingHatInf is a library that implements ML-based feature type inference as seen in the paper here. Feature type inference is the task of predicting the feature types of the columns of a given dataset.

Feature Types

SortingHat

  • numeric
  • categorical
  • datetime
  • sentence
  • url
  • embedded-number
  • list
  • not-generalizable
  • context-specific

Expanded

Same as SortingHat except:

  • numeric mapped to integer or floating
  • categorical mapped to boolean if Boolean

ARFF (loose)

  • Nominal-specification (Categorical)
  • INTEGER
  • REAL (Float)
  • STRING
  • IGNORE (Not-Generalizable)

API Documentation

get_sortinghat_types(df: pd.DataFrame) -> List[str] returns a list of the predicted SortingHat feature types on the columns of the specified Pandas dataframe
Ex. infer_sh = sortinghatinf.get_sortinghat_types(df)

> infer_sh  
> [  
>    'COL_TYPE_1',  
>    'COL_TYPE_2',  
>    ...  
> ]  

get_expanded_feature_types(df: pd.DataFrame) -> List[str] returns a list of the predicted SortingHat feature types on the columns of the specified Pandas dataframe mapped to the expanded types
Ex. infer_exp = sortinghatinf.get_expanded_types(df)

> infer_exp  
> [    
>    'COL_TYPE_1',  
>    'COL_TYPE_2',  
>    ...   
> ]  

get_feature_types_as_arff(df: pd.DataFrame) -> Tuple[List[Tuple[str, Union[str, List[str]]]], List[str]] returns the predicted SortingHat feature types mapped to the loose ARFF types and the original predicted SortingHat feature types
Ex. infer_arff, infer_sh = sortinghatinf.get_expanded_types(df)

> infer_arff  
> [  
>    ('COL_NAME_1', ['POSSIBLE_VALUE_1', 'POSSIBLE_VALUE_2', ...]), # NOMINAL  
>    ('COL_NAME_2', 'INTEGER'), # INTEGER  
>    ('COL_NAME_3', 'FLOAT'), # REAL  
>    ('COL_NAME_4', 'STRING'), # STRING  
>    ('COL_NAME_5', 'IGNORE'), # IGNORE  
>    ...  
> ]  

Note: Because ARFF expects a string list for categorical features, columns discovered to be categorical should be converted to string. This function will report these columns with an error.

Example Usage with OpenML

Here, we run feature type inference on a dataset obtained from OpenML. Note: this can be done with any dataset loaded as a Pandas dataframe, but we use OpenML here as an example.

  1. First ensure pip, wheel, and setuptools are up-to-date.
python -m pip install --upgrade pip setuptools wheel
  1. Install the package using python-pip.
pip install sortinghatinf
  1. Import the library.
import sortinghatinf
  1. Install the OpenML python API.
pip install openml
  1. Import the OpenML python library.
import openml
  1. Load the 'Blood Transfusion Service Center' dataset from OpenML (dataset_id=31). Note: This requires an OpenML account which you can setup by following this link.
data = openml.datasets.get_dataset(dataset_id=31)
X, _, _, _ = data.get_data() # Loaded as Pandas dataframe
  1. Infer the feature types for the data columns.
# Infer the SortingHat feature types.
infer_sh = sortinghatinf.get_sortinghat_types(X)

# Infer the expanded feature types.
infer_exp = sortinghatinf.get_expanded_feature_types(X)

# Infer the ARFF feature types.
# The function `get_feature_types_as_arff()` also returns the SortingHat feature types.
infer_arff, infer_sh = sortinghatinf.get_feature_types_as_arff(X)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sortinghatinf-0.0.7.tar.gz (7.6 MB view details)

Uploaded Source

Built Distribution

sortinghatinf-0.0.7-py3-none-any.whl (15.3 MB view details)

Uploaded Python 3

File details

Details for the file sortinghatinf-0.0.7.tar.gz.

File metadata

  • Download URL: sortinghatinf-0.0.7.tar.gz
  • Upload date:
  • Size: 7.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.6.9

File hashes

Hashes for sortinghatinf-0.0.7.tar.gz
Algorithm Hash digest
SHA256 9f2e0b148f6733a900420cc313b20f2546e7f4c6c013322d2b269e9656b2a574
MD5 7789bd017d4a1bbb547cd7f334a2a1f6
BLAKE2b-256 4b8df9cb040df137d98d4f8790989ec1bddd4b845ed30008609a29f429f30c49

See more details on using hashes here.

File details

Details for the file sortinghatinf-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: sortinghatinf-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 15.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.6.9

File hashes

Hashes for sortinghatinf-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3830b570e6c848b209d9c3e44b2b21458c18159f6cd64e7446d04be7668ac06c
MD5 3c9cca8526852f5b80156fc645b00aa7
BLAKE2b-256 eacb5608e17c4832c2b490bb61f5a44787fd6f37bb6a78a142c37c55cade05e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page