Skip to main content

No project description provided

Project description

rdfdf

pipeline status PyPI version

rdfdf - Functionality for rule-based pandas.DataFrame - rdflib.Graph conversion.

For representation of tabular data in RDF see Allemang, Hendler: Semantic Web for the Working Ontologist. 2011, 40ff.

This project is in an early stage of development and should be used with caution.

Requirements

  • python >= 3.10

Usage

For now rdfdf provides a DFGraphConverter class for rule-based pandas.DataFrame to rdflib.Graph conversion.

A GraphDFConverter class for rule-based rdflib.Graph to pandas.DataFrame conversion will be implemented shortly.

DFGraphConverter

Unlike rdfpandas which requires URIRefs as column headers (and otherwise just creates invalid RDF with e.g. literals as predicates), rdfdf computes URIRefs (or Literals for triple objects) based on rules.

DFGraphConverter iterates over a dataframe and constructs RDF triples by constructing a generator of subgraphs ('field graphs') and then merging all subgraphs with an rdflib.Graph component.

Subgraphs are generated by

  • for every row
    • for every rule in column_rules
      • looking up the column_rules key for the current row and calling the corresponding column_rules value.

column_rules values must be callables which are responsible for generating and returning a graph for merging. Note that rules actually don't need to return an instance of rdflib.Graph (e.g. if a rule just accesses __store__; see the state sharing example below), in which case the result is skipped in the generator.

column_rules callables are of arity 0 but have access to subject and object values from the table anaphorically as __subject__ and __object__ respectively (see examples below).

Note that also __store__ is available in rules; __store__ references the class level attribute store (a dictionary), this makes state sharing between rules and DFGraphConverter instances possible.

Parameters:

  • dataframe: A pandas.DataFrame to be converted.

  • subject_column: Selects a table column by name to be regarded as the column of triple subjects.

  • subject_rule: Optional; either a Callable[[str], URIRef] or an rdflib.Namespace which gets applied to every field of the subject_column; if supplied, __subject__ in the column_rules will be what subject_rule computes it to be; otherwise __subject__ will be just the raw field value of the current subject_column and must be handled manually in order to be become a valid triple subject (i.e. a URIRef).

  • column_rules: A mapping of column names to callables responsible for creating subgraphs ('field graphs'). As mentioned these callables have access to __subject__ and __object__ references.

  • graph: Optional; allows to set the internal rdflib.Graph component.

Examples:

See examples.

A slightly more involved example:

Example data:

"id";"full_title"
"rem";"Reference corpus Middle High German"

Desired output:

<https://rem.clscor.io/entity/corpus> crm:P1_is_identified_by <https://rem.clscor.io/entity/corpus/title/full> . 

<https://rem.clscor.io/entity/corpus/title/full> a crm:E41_Appellation ; 
    crm:P1i_identifies <https://rem.clscor.io/entity/corpus> ; 
    crm:P2_has_type <https://core.clscor.io/entity/type/title/full> ; 
    crm:190_has_symbolic_content "Reference corpus Middle High German" .

/examples/example_3.py:

import pandas as pd

from rdflib import URIRef, Graph, Namespace, Literal
from rdflib.namespace import RDF

from rdfdf import DFGraphConverter


CRM = Namespace("http://www.cidoc-crm.org/cidoc-crm/")

table = [
    {
        "id": "rem",
        "full_title": "Reference corpus Middle High German"
    }
]

df = pd.DataFrame(data=table)


def full_title_rule():
    
    graph = Graph()
    subject_uri = URIRef(f"https://{__subject__}.clscor.io/entity/corpus/title/full")

    triples = [
        (
            subject_uri,
            RDF.type,
            CRM.E41_Appellation
        ),
        (
            subject_uri,
            CRM.P1_identifies,
            URIRef(f"https://{__subject__}.clscor.io/entity/corpus")
        ),
        # inverse
        (
            URIRef(f"https://{__subject__}.clscor.io/entity/corpus"),
            CRM.P1_is_identified_by,
            subject_uri
        ),
        (
            subject_uri,
            CRM.P2_has_type,
            URIRef("https://core.clscor.io/entity/type/title/full")
        ),
        (
            subject_uri,
            URIRef("http://www.cidoc-crm.org/cidoc-crm/190_has_symbolic_content"),
            Literal(__object__)
        ),
    ]

    for triple in triples:
        graph.add(triple)

    return graph

    
column_rules = {
    "full_title": full_title_rule
}

dfgraph = DFGraphConverter(
    dataframe=df,
    subject_column="id",
    column_rules=column_rules,
)

graph = dfgraph.to_graph()
print(graph.serialize(format="ttl"))

Note that the rule is rather verbose and could probably be expressed more concisely with Python ontology abstractions like pydantic-cidoc-crm.

[todo: express above rule more concisely]

State sharing example

As mentioned, state can be shared between rules via the __store__ binding and also between DFGraphConverter instances via the store class level attribute (which __store__ actually references).

The following example constructs an RDF literal from multiple table fields:

Test data:

"Name";"Address";"Place";"Country";"Age";"Hobby";"Favourite Colour" 
"John";"Dam 52";"Amsterdam";"The Netherlands";"32";"Fishing";"Blue"
"Jenny";"Leidseplein 2";"Amsterdam";"The Netherlands";"12";"Dancing";"Mauve"
"Jill";"52W Street 5";"Amsterdam";"United States of America";"28";"Carpentry";"Cyan"
"Jake";"12E Street 98";"Amsterdam";"United States of America";"42";"Ballet";"Purple"

/examples/example_4.py:

import pandas as pd
from rdfdf import DFGraphConverter
from rdflib import Namespace, Literal, Graph, URIRef
from rdflib.namespace import FOAF, RDF

example_ns = Namespace("http://example.org/")

def name_rule():
    graph = Graph()
    
    graph.add((__subject__, RDF.type, FOAF.Person)) \
         .add((__subject__, FOAF.name, Literal(__object__)))
    
    return graph

def age_rule():
    graph = Graph()
    graph.add((__subject__, example_ns.age, Literal(__object__)))

    return graph


def address_rule():
    __store__["address"] = __object__
    
def place_rule():
    __store__["place"] = __object__
    
def full_address_rule():
    _full_address = (
        f"{__store__['address']}, "
        f"{__store__['place']}, "
        f"{__object__}"
    )

    graph = Graph()
    graph.add((__subject__, example_ns.fullAddress, Literal(_full_address)))

    return graph
    

test_column_rules = {
    "Name": name_rule,
    "Age": age_rule,
    "Address": address_rule,
    "Place": place_rule,
    "Country": full_address_rule
}

df = pd.read_csv("../tests/test_data/test.csv", sep=";")

dfgraph = DFGraphConverter(
    dataframe=df,
    subject_column="Name",
    subject_rule=example_ns,
    column_rules=test_column_rules,
)

graph = dfgraph.to_graph()
print(graph.serialize(format="ttl"))

Output

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ns1: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:Jake a foaf:Person ;
    ns1:age 42 ;
    ns1:fullAddress "12E Street 98, Amsterdam, United States of America" ;
    foaf:name "Jake" .

ns1:Jenny a foaf:Person ;
    ns1:age 12 ;
    ns1:fullAddress "Leidseplein 2, Amsterdam, The Netherlands" ;
    foaf:name "Jenny" .

ns1:Jill a foaf:Person ;
    ns1:age 28 ;
    ns1:fullAddress "52W Street 5, Amsterdam, United States of America" ;
    foaf:name "Jill" .

ns1:John a foaf:Person ;
    ns1:age 32 ;
    ns1:fullAddress "Dam 52, Amsterdam, The Netherlands" ;
    foaf:name "John" .

Note that although address_rule and place_rule do not return graph instances but merely set up __store__, they obviously still must be connected to a table field in the column_rules mapping.

GraphDFConverter

[todo]

Contribution

Please feel free to open issues or pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdfdf-0.1.6.tar.gz (20.7 kB view hashes)

Uploaded Source

Built Distribution

rdfdf-0.1.6-py3-none-any.whl (22.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page