Primary sources
The general pattern for converting a non-RDF document is: as a format is parsed, certain “events” fire, each of which is an opportunity to generate RDF triples based on the event and the current location in the document.
Each event indicates what has just been parsed by sending a tag, content, and a dictionary of attributes (only tag is certain to have a value). The particular tags and attributes will be specific to the format and data. For instance, the event that fires for a cell in a CSV file in the column “day” with value “Monday” would send the tag “cell”, the content “Monday”, and the attribute dictionary:
{"id" : "day"}
Along with the tag and attributes, the event sends its location, which for most formats is the list of “parent” events the current event is happening under. If the above event was happening while processing the 22nd row of a file called “some_file.csv”, the location might be (ignore for the moment the “uid” entries):
[
{"tag" : "table", "content" : "", "attributes" : {"id" : "some_file.csv"}, "uid" : "43k2"},
{"tag" : "row", "content" : "", "attributes" : {"id" : "22"}, "uid" : "213j"}
]
No matter the format (CSV, XML, etc), events have the same structure, and in fact the event just described might be this JSON object:
{
"tag" : "cell",
"attributes" : {"id" : "day", "value" : "Monday"},
"location" : [
{"tag" : "table", "attributes" : {"id" : "some_file.csv"}, "uid" : "43k2"},
{"tag" : "row", "attributes" : {"id" : "22"}, "uid" : "213j"}
],
"uid" : "t98f"
}
Again, the possible values for tag will depend on the format (HTML won’t ever have a “row” tag, but might have “div”, “body”, etc), as will the attributes dictionary.
The goal is to decide what RDF triples to generate when seeing an event. This involves specifying rules that 1) can be determined if they match the event, and 2) describe the RDF triple(s) to create from it. Here is an example of a match portion:
{
"tag" : ["cell"],
"attributes" : {"id" : ["day"]}
}
Note how it constrains the tag and the attribute “id” by giving a list of acceptable values.
Here is an example of a creation portion with some placeholders for readability, that creates the two triples (S, P1, O1) and (S, P2, O2) when its rule matches:
{
"subject" : S,
"predicate_objects" : [
{
"predicate" : P1,
"object" : O1
},
{
"predicate" : P2,
"object" : O2
}
]
}
The placeholders are a bit more interesting: they tell OCHRE how to create an RDF node based on the event. Here is an example that creates an RDF integer literal node that doesn’t depend at all on the event:
{
"type" : "literal",
"datatype" : "integer",
"value" : "27"
}
Here is an example that also creates an integer literal node, but based on the event:
{
"type" : "literal",
"datatype" : "integer",
"value" : "{content}"
}
In the CSV example, if the rule were matching rows, this would correspond to the row number. This curly-braces interpolation can also be used to refer to attributes and locations in the event, and mixed arbitrarily with bare strings, allowing the extraction of fairly sophisticated patterns.
Here is an example that creates a URI node, directly specifying the Wikidata entry for “photograph”:
{
"type" : "uri",
"value" : "wd:Q125191"
}
Importantly, most entities in a primary source will not have a clear corresponding entity in Wikidata (e.g. there may be a long list of photos, so the above example is useful for saying “this is an instance of a photo”, but not for referring to this or that specific photo). To handle this, every time an event occurs, OCHRE creates a unique identifier based on the event. This unique identifier is the “uid” seen in the full event example above, and can be interpolated as-needed to derive unique URIs. For instance:
{
"type" : "uri",
"value" : "ochre:{uid}"
}
is an entity in the OCHRE namespace corresponding to the particular event being processed.
Finally, OCHRE keeps track of the sequential number of each tag value within one tier of the input, and this number can be interpolated with “index”. For example, if the input involves processing sentences, each of which are a sequence of words, the string “{index}” within a word-rule will give the current word’s number within its sentence, starting from 0.
Materials
The mechanisms described above are for generating RDF. There is also the need to connect parts of RDF to materials, larger files that don’t belong directly in the RDF graph, such as JPGs, audio recordings, and long documents. To accomplish this, there is special information that can be added to an entry in a “predicate_object” list:
{
"predicate" : P,
"object" : O,
"file" : "path/some_file_{attributes['name']}.jpg",
"file_type" : "image/jpg"
}
When the pyochre.primary_sources script encounters a “file” like this, it looks for it on the local filesystem. If found, it creates a unique identifier I based on the file’s contents, and adds an additional RDF triple that links it to the object in the predicate_object rule (roughly, (O, hasMaterialId, I)) indicating “the entity O has an associated file identified with the id I”. Then, after OCHRE creates the RDF graph, it also uploads all such files in the appropriate fashion.