Skip to main content

Exposes RDF datasets from sparql endpoints for machine learning models in convenient formats like pandas dataframe

Project description

RDFframes

A Python library that enables data scientists to extract data from knowledge graphs encoded in RDF into familiar tabular formats using familiar procedural Python abstractions. RDFframes provides an easy-to-use, efficient, and scalable API for users who are familiar with the PyData (Python for Data) ecosystem but are not experts in SPARQL. The API calls are internally converted into optimized SPARQL queries, which are then executed on a local RDF engine or a remote SPARQL endpoint. The results are returned in tabular format, such as a pandas dataframe.

Installation via pip

You can directly install the library via pip by using:

 $ pip install RDFframes

Getting started

First create a KnowledgeGraph to specify any namespaces that will be used in the query and optionally the graph name and URI. For example:

graph = KnowledgeGraph(prefixes={
                               "swrc": "http://swrc.ontoware.org/ontology#",
                               "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
                               "dc": "http://purl.org/dc/elements/1.1/",
                           })

Then create a Dataset using one of our convenience functions. All the convenience functions are methods in the KnowledgeGraph class. For example, the following code retrieves all instances of the class swrc:InProceedings:

dataset = graph.entities(class_name='swrc:InProceedings',
                             new_dataset_name='papers',
                             entities_col_name='paper')

There are two types of datasets: ExpandableDataset and GroupedDataset. An ExpandableDataset represents a simple flat table, while a GroupedDataset is a table split into groups as a result of a group-by operation. The convenience functions on the KnowledgeGraph return an ExpandableDataset.

After instantiating a dataset, you can use the API to perform operations on it. For example, the following code retrieves all authors and titles of conference papers:

dataset = dataset.expand(src_col_name='paper', predicate_list=[
        RDFPredicate('dc:title', 'title'),
        RDFPredicate('dc:creator', 'author'),
        RDFPredicate('swrc:series', 'conference')])\

Using the group_by operation results in a GroupedDataset:

grouped_dataset = dataset.group_by(['author'])

Aggregation can be done in both an ExpandableDataset and GroupedDataset. For example, the following code counts the number of papers per author and keeps only the authors that have more than 20 papers:

grouped_dataset = grouped_dataset.count(aggregation_fn_data=[AggregationData('paper', 'papers_count')])\
        .filter(conditions_dict={'papers_count': ['>= 20']})

Convenience Functions to create an initial dataset

To create an initial Dataset, you need to use one of the convenience functions. The API provides convenience functions that can be used by most of the machine learning and data analytics tasks including:

KnowledgeGraph.classes_and_freq()

This function retrieves all the classes in the graph and all the number of instances of each class. It returns a table of two columns, the first one contains the name of the class and the second one contains the name of the frequency of the clases.

KnowledgeGraph.features_and_freq(class_name)

Retrieves all the features of the instances of the class class_name and how many instances have each features. This is critical for many machine learning tasks as knowing how many observed features of entities helps us decide on which features to use for.

KnowledgeGraph.entities(class_name)

Retrieves all the instances of the class class_name. This is the starting point for most machine learning models. The return dataset contains one column of the entities of the specified class and can be expanded to add features of the instances.

KnowledgeGraph.features(class_name)

Retrieves all the features of the class class_name. This function can be used to explore the dataset and learn what features are available in the data for a specific class.

KnowledgeGraph.entities_and_features(class_name, features, )

Retrieves all instances of the class class_name and the features of the instances specified in the list features.

KnowledgeGraph.num_entities(class_name)

Returns the number of instances of the class class_name in the dataset.

KnowledgeGraph.feature_domain_range(feature)

Returieves the domain (subjects) and the range (objects) of the predicate feature occuring in the dataset.

KnowledgeGraph.describe_entity(entity)

Returns the class and features of the entity.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdfframes-0.9.3.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdfframes-0.9.3-py3-none-any.whl (53.5 kB view details)

Uploaded Python 3

File details

Details for the file rdfframes-0.9.3.tar.gz.

File metadata

  • Download URL: rdfframes-0.9.3.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.1

File hashes

Hashes for rdfframes-0.9.3.tar.gz
Algorithm Hash digest
SHA256 3437b627949a271948215f5e412b7279015741540739256773c12aa8b1cb9456
MD5 ccd7e6da451c6901ffb2ad6039414b91
BLAKE2b-256 a71e7dfa1586d0b076f9c06ebac7d9c5e9d092682d96e0539649f63d4da170ac

See more details on using hashes here.

File details

Details for the file rdfframes-0.9.3-py3-none-any.whl.

File metadata

  • Download URL: rdfframes-0.9.3-py3-none-any.whl
  • Upload date:
  • Size: 53.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.1

File hashes

Hashes for rdfframes-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 596cfa1eab84ff58deb4344a6287218e78c0ac41c25da3a1e4705d8b221f82cd
MD5 e3f5a5947f301f5d2bf8472a1992501d
BLAKE2b-256 443edc3421b9dc39b0832ccc01c96778a610e44d6059bde405ffd45e8f88b214

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page