Skip to main content

A small library for loading and downloading relational datasets

Project description

relational-datasets

A small library for loading and downloading relational datasets.

pip install relational-datasets

PyPi Version License Total alerts codecov Python Package Builds Documentation Deploy

Beta Release

This API and the datasets at https://github.com/srlearn/datasets/ are currently being experimented with.

Open enhancements and bugs are tracked here:

But here is a short-term Roadmap:

  • Modes: srlearn/datasets: Issue 11
  • Converting propositional->relational
    • Problem Settings
      • Binary Classification
      • Regression
        • Regression: y ∈ float
      • Multiclass Classification: When target is int and in [0, 1, 2, ...]
    • Categorical datatype support in X matrix.
    • Dataframes: pandas

Use Case 1: Fetching Zipfiles

Running the fetch method downloads a version of a datset to your local cache:

import relational_datasets

relational_datasets.fetch("toy_cancer")
relational_datasets.fetch("toy_father", "v0.0.3")
relational_datasets.fetch("cora")

Resulting in:

~/relational_datasets/
├── toy_cancer_v0.0.4.zip   <--- latest
├── toy_father_v0.0.3.zip   <--- specific version
└── cora_v0.0.4.zip         <--- latest

Use Case 2: Loading Data

The load method returns train and test folds—each with pos, neg, and facts. Internally it uses fetch, so it will automatically download a dataset if it is not available.

For example: "Load fold-2 of webkb"

from relational_datasets import load

train, test = load("webkb", "v0.0.4", fold=2)

len(train.facts)
# 1344

Use Case 3: Working with Standard (Vector-based) Machine Learning Datasets

The relational_datasets.convert module has functions for converting vector-based datasets into relational/ILP-style datasets:

Binary Classification

When y is a vector of 0/1

from relational_datasets.convert import from_numpy
import numpy as np

data, modes = from_numpy(
  np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
  np.array([0, 0, 1]),
)

data, modes
(RelationalDataset(pos=['v4(id3).'], neg=['v4(id1).', 'v4(id2).'], facts=['v1(id1,0).', 'v1(id2,0).', 'v1(id3,1).', 'v2(id1,1).', 'v2(id2,1).', 'v2(id3,2).', 'v3(id1,1).', 'v3(id2,2).', 'v3(id3,2).']),
['v1(+id,#varv1).', 'v2(+id,#varv2).', 'v3(+id,#varv3).', 'v4(+id).'])

Regression

When y is a vector of floats

from relational_datasets.convert import from_numpy
import numpy as np

data, modes = from_numpy(
  np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
  np.array([1.1, 0.9, 2.5]),
)

data, modes
(RelationalDataset(pos=['regressionExample(v4(id1),1.1).', 'regressionExample(v4(id2),0.9).', 'regressionExample(v4(id3),2.5).'], neg=[], facts=['v1(id1,0).', 'v1(id2,0).', 'v1(id3,1).', 'v2(id1,1).', 'v2(id2,1).', 'v2(id3,2).', 'v3(id1,1).', 'v3(id2,2).', 'v3(id3,2).']),
['v1(+id,#varv1).', 'v2(+id,#varv2).', 'v3(+id,#varv3).', 'v4(+id).'])

Preprocessing scikit-learn's load_breast_cancer

load_breast_cancer is based on the Breast Cancer Wisconsin dataset.

Here we: (1) load the data and class labels, (2) split into training and test sets, (3) bin the continuous features to discrete, and (4) convert to the relational format.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

# (1) Load
X, y = load_breast_cancer(return_X_y=True)

# (2) Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# (3) Discretize
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")
X_train = disc.fit_transform(X_train)
X_test = disc.transform(X_test)
X_train = X_train.astype(int)
X_test = X_test.astype(int)

# (4) Convert
from relational_datasets.convert import from_numpy

train, modes = from_numpy(X_train, y_train)
test, _ = from_numpy(X_test, y_test)

Install

From PyPi

pip install relational-datasets

From GitHub Source

git clone https://github.com/srlearn/relational-datasets.git
cd relational-datasets
pip install -e .

Contributions

This package was partially based on datasets from the Starling Lab Datasets Collection, which included specific contributions by Harsha Kokel and Devendra Singh Dhami. Tushar Khot converted many to the ILP format from Alchemy 2 format, but that occurred before versions were tracked. Some inspiration was drawn from the "RelationalDatasets" list that Jonas Schouterden collected.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

relational-datasets-0.4.0.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

relational_datasets-0.4.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file relational-datasets-0.4.0.tar.gz.

File metadata

  • Download URL: relational-datasets-0.4.0.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for relational-datasets-0.4.0.tar.gz
Algorithm Hash digest
SHA256 bb1849b635d6af8df702f05d89c45f1699b38f46e83a50730a0808628efdddeb
MD5 0cec9984293597e195c1c782cffdca9b
BLAKE2b-256 fad42ccf270aea207bab941ba39f7be4beea997cc806838acb0a60b0c736789e

See more details on using hashes here.

File details

Details for the file relational_datasets-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for relational_datasets-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 deff66146371a9671e8ba533806690d060404e2f4c474c74dbd77b8fa708fefa
MD5 6c37fc06344622a3afff255539938519
BLAKE2b-256 b8c6c9d8d83f4d1f9aed855d682bd9447f434e28d756e7833754ae1241c2653d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page