Skip to main content

Cube for data input/output, import and export

Project description

Cube for data input/output, import and export

Massive Store

The Massive Store is a CW store used to push massive amount of data using pure SQL logic, thus avoiding CW checks. It is faster than other CW stores (it does not check eid at each step, it use COPY FROM method), but is less safe (no data integrity securities), and does not return an eid while using create_entity function.

WARNING: This store may be only used with PostgreSQL for now, as it relies on the COPY FROM method, and on specific PostgreSQL tables to get all the indexes.

It should be used as following:

# Initialize the store store = MassiveObjectStore(session) # Initialize the Relation table store.init_rtype_table(‘Person’, ‘lives’, ‘Location’)

# Import logic … store.create_entity(‘Person’, …) store.create_entity(‘Location’, …)

# Flush the data in memory to sql database store.flush()

# Import logic … store.create_entity(‘Person’, …) store.create_entity(‘Location’, …) # Person_iid and location_iid are unique iid that are data dependant (e.g URI) store.relate_by_iid(person_iid, ‘lives’, location_iid) …

# Flush the data in memory to sql database store.flush()

# Build the meta data store.flush_meta_data()

# Convert the relation store.convert_relations(‘Person’, ‘lives’, ‘Location’)

# Clean the store / rebuild indexes store.cleanup()

In this case, iid_subj and iid_obj represent an unique id (e.g. uri, or id from the imported database) that can be used to create relations after importing entities.

RDF Store

The RDF Store is used to import RDF data into a CubicWeb data, based on a Yams <-> RDF schema conversion. The conversion rules are stored in a XY structure.

Building an XY structure

You have to create a file (usually called xy.py) in your cube, and import the dataio version of xy:

from cubes.dataio import xy

You have to register the different prefixes (common prefixes as skos or foaf are already registered):

xy.register_prefix('diseasome', 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/')

By default, the entity type is based on the rdf property “rdf:type”, but you may changed it using:

xy.register_rdf_etype_property('skos:inScheme')

It is also possible to give a specific callback to determine the entity type from the rdf properties:

def _rameau_etype_callback(rdf_properties):
    if 'skos:inScheme' in rdf_properties and 'skos:prefLabel' in rdf_properties:
       return 'Rameau'

xy.register_etype_callback(_rameau_etype_callback)

The URI is fetched from the “rdf:about” property, and can be normalized using a specific callback:

def normalize_uri(uri):
    if uri.endswith('.rdf'):
       return uri[:-4]
    return uri

xy.register_uri_conversion_callback(normalize_uri)

Defining the conversion rules

Then, you may write the conversion rules:

  • xy.add_equivalence allows you to add a basic equivalence between entity type / attribute / relations, and RDF properties. You may use “*” as a wild cart in the Yams part. E.g. for entity types:

    xy.add_equivalence('Gene', 'diseasome:genes')
    xy.add_equivalence('Disease', 'diseasome:diseases')

    E.g. for attributes:

    xy.add_equivalence('* name', 'diseasome:name')
    xy.add_equivalence('* label', 'rdfs:label')
    xy.add_equivalence('* label', 'diseasome:label')
    xy.add_equivalence('* class_degree', 'diseasome:classDegree')
    xy.add_equivalence('* size', 'diseasome:size')

    E.g. for relations:

    xy.add_equivalence('Disease close_match ExternalUri', 'diseasome:classes')
    xy.add_equivalence('Disease subtype_of Disease', 'diseasome:diseaseSubtypeOf')
    xy.add_equivalence('Disease associated_genes Gene', 'diseasome:associatedGene')
    xy.add_equivalence('Disease chromosomal_location ExternalUri', 'diseasome:chromosomalLocation')
    xy.add_equivalence('* sameas ExternalUri', 'owl:sameAs')
    xy.add_equivalence('Gene gene_id ExternalUri', 'diseasome:geneId')
    xy.add_equivalence('Gene bio2rdf_symbol ExternalUri', 'diseasome:bio2rdfSymbol')
  • A base URI can be given to automatically determine if a Resource should be considered as an external URI or an internal relation:

    xy.register_base_uri('http://www4.wiwiss.fu-berlin.de/diseasome/resource/')

    A more complex logic can be used by giving a specific callback:

    def externaluri_callback(uri):
        if uri.startswith('http://www4.wiwiss.fu-berlin.de/diseasome/resource/'):
           if uri.endswith('disease') or uri.endswith('gene'):
              return False
           return True
        return True
    
    xy.register_externaluri_callback(externaluri_callback)

The values of attributes are built based on the Yams type. But you could use a specific callback to compute the correct values from the rdf properties:

def _convert_date(_object, datetime_format='%Y-%m-%d'):
    """ Convert an rdf value to a date """
    try:
       return datetime.strptime(_object.format(), datetime_format)
    except:
       return None

xy.register_attribute_callback('Date', _convert_date)

or:

def format_isbn(rdf_properties):
    if 'bnf-onto:isbn' in rdf_properties:
       isbn = rdf_properties['bnf-onto:isbn'][0]
       isbn = [i for i in isbn if i in '0123456789']
       return int(''.join(isbn)) if isbn else None

xy.register_attribute_callback('Manifestation formatted_isbn', format_isbn)

Importing data

Data may thus be imported using the “import-rdf” command of cubicweb-ctl:

cubicweb-ctl import-rdf <my-instance> <filer-or-folder>

The default library used for reading the data is “rdflib” but one may use “librdf” using the “–lib” option.

It is also possible to force the rdf-format (it is automatically determined, but this may sometimes lead to errors), using the “–rdf-format” option.

Exporting data

The view ‘rdf’ may be called and will create a RDF file from the result set. It is a modified version of the CubicWeb RDFView, that take into account the more complex conversion rules from the dataio cube. The format can also be forced (default is XML) using the “–format” option in the url (xml, n3 or nt).

Examples

Examples of use of dataio rdf import could be found in the nytimes and diseasome cubes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cubicweb-dataio-0.2.0.tar.gz (32.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page