Skip to main content

Package to package to transform structured data into RDF graph form

Project description

quadipy

Logo

quadipy is a python package to help transform structured data into RDF graph format.

We built quadipy to enable developers to build a config based ingestion pipeline to an RDF data store (think like FiveTran or Stitch but for RDF). quadipy won't explcitly handle connections to different systems but will allow you to configure the RDF data you want to create from any data source. quadipy leverages RDFLib to pythonically structure RDF data.

The goal with this project is to enable transforming any tabular data structure into graph based RDF data. We go into depth here on how we have used this config based system to build out our internal knowledge graph at Vouch! You can also check out our talk at KGC'22 on YouTube: Modeling the startup ecosystem using a config based knowledge graph.

An example below shows what we mean by translating some tabular data into RDF graph data.

Table to Graph

For a step by step walk-through of how to do this with more examples, visit tutorials

Dev Setup

Run to set up your dev enviornment

make setup

Usage

from quadipy import GraphFormatConfig

config = GraphFormatConfig.parse_file("path/to/config.json")
quads = [config.quadify(record) for record in records] # records is an Iterator[Dict]

quadipy can work with a variety of different data sources as long as what is sent to the quadify method is a Dict

Each Quad created has a .to_tuple() method that converts it to a tuple to help facilitate working with RDFLib Graphs

from rdflib import URIRef, Graph
from quadipy import Quad

quad = Quad(subject=URIRef("Alice"), predicate=URIRef("knows"), obj=URIRef("Bob"), graph=None)
g = Graph()
g.add(quad.to_tuple())

Setting up a GraphFormatConfig

The main value of quadipy comes from the GraphFormatConfig, which takes in a few parameters to configure the transformation of your data into RDF graph format. We provide configuration examples in the examples directory to help you get started. The full list of fields that can be configured is described below:

field required description
source_name Yes A string that is used to describe the source (i.e. "wikipedia" for data from wikipedia)
primary_key Yes This is the key in your data that will be used for the subject of each value
predicate_mapping Yes A mapping where the keys are column names in your data source, and values a nested dict that required a predicate_uri key mapped to the RDF predicate in the target location and an optional obj_datatype key that maps to a custom datatype (currently we support [literal, uriref, or date]). If obj_datatype isn't specified, it will default to literal
subject_namespace No A string prepended to the quad's subject as a namespace, instead of just using the value of the primary_key. For example, for primary_key=123 and subject_namespace=wikipedia the values generated would NOT be URIRef("123") but URIRef("wikipedia/123")
graph_namespace No Similar to subject_namespace in that this will assign each fact to a named graph with the graph_namespace. This is useful to store metadata about fact provenance in named graphs.
date_field No The column in your dataset that the fact's "date" will be pulled from. When specified, the named graph field in each fact will be built from the date. For example if date_field=created_at and created_at='2021-01-01 in the source data the graph field will be URIRef("2021-01-01"). This can can work in conjunction with graph_namespace

Validate Config files

To make sure the config files are valid, run the CLI by using the command

quadipy validate {path}

where path could be the directory for all the config files, e.g. examples or a single config file e.g. examples/simple.json

This script uses pydantic validator to make sure the config file is a valid JSON file, the required fields are presented and the predicate_uris are valid URIs ("valid" defined by RDFLib here)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quadipy-0.2.3.15.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

quadipy-0.2.3.15-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file quadipy-0.2.3.15.tar.gz.

File metadata

  • Download URL: quadipy-0.2.3.15.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for quadipy-0.2.3.15.tar.gz
Algorithm Hash digest
SHA256 5498830e8bc6ada0a9ca7d8d8dfc0056c4d8b76a39ba04365207bb207658515a
MD5 6bddd59406e68c8ea6207d2436813184
BLAKE2b-256 ffe5824c4eb455fd66bd47e8af7656463947af1a5954aa4412bd8f03357a8e48

See more details on using hashes here.

Provenance

File details

Details for the file quadipy-0.2.3.15-py3-none-any.whl.

File metadata

  • Download URL: quadipy-0.2.3.15-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for quadipy-0.2.3.15-py3-none-any.whl
Algorithm Hash digest
SHA256 41871c01aaf7854d655d20350582dcd2ad8ce75b903a6b17acfdeccd945b58d1
MD5 a69b58bddf9d0f2d159795436a37d7af
BLAKE2b-256 ddce8fae446edb9197242d6a0a46ff71cdc6bd03e187bf5f1331c078cf020b2a

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page