Skip to main content

Python functions for working with CSV on the Web datasets

Project description

csvw_functions

Python implementation of the CSV on the Web (CSVW) standards.

Contents

About | Installation | Issues, Questions? | Quick Start | API | Developer Notes

About

This is a Python package which implements the following W3C standards:

These standards together comprise the CSV on the Web (CSVW) standards.

Further information on CSVW is available from:

The package is written as pure Python and passes all the required tests in the CSVW Test Suite:

  • CSVW JSON tests: passes 270 / 270
  • CSVW RDF tests: passes 270 / 270
  • CSVW Validation tests: passes 282 / 282

Installation

Source code available on GitHub here: https://github.com/stevenkfirth/csvw_functions

Install from PyPi using command: pip install csvw_functions

Issues, Questions?

The CSVW standards represent a complex set of operations and there are likely to be a number of situations where things don't work as expected. When this happens...

Raise an issue on GitHub: https://github.com/stevenkfirth/csvw_functions/issues

Email the author: Steven Firth, s.k.firth@lboro.ac.uk

Quick Start

Access embedded metadata from CSV file

Let's say we have a CSV file with the contents...

# countries.csv
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

...and we'd like to extract information from the column headers in the form of a CSVW metadata JSON object. We would use the get_embedded_metadata function:

>>> import csvw_functions
>>> embedded_metadata = csvw_functions.get_embedded_metadata(
        'countries.csv',
        relative_path=True  # this means that the `url` property will contain a relative file path
        )
>>> print(embedded_metadata)
{
    "@context": "http://www.w3.org/ns/csvw",
    "tableSchema": {
        "columns": [
            {
                "titles": {
                    "und": [
                        "country"
                    ]
                },
                "name": "country"
            },
            {
                "titles": {
                    "und": [
                        "country group"
                    ]
                },
                "name": "country%20group"
            },
            {
                "titles": {
                    "und": [
                        "name (en)"
                    ]
                },
                "name": "name%20%28en%29"
            },
            {
                "titles": {
                    "und": [
                        "name (fr)"
                    ]
                },
                "name": "name%20%28fr%29"
            },
            {
                "titles": {
                    "und": [
                        "name (de)"
                    ]
                },
                "name": "name%20%28de%29"
            },
            {
                "titles": {
                    "und": [
                        "latitude"
                    ]
                },
                "name": "latitude"
            },
            {
                "titles": {
                    "und": [
                        "longitude"
                    ]
                },
                "name": "longitude"
            }
        ]
    },
    "url": "countries.csv"
}

(This example is taken from Section 1.3 of the CSVW Primer: https://www.w3.org/TR/tabular-data-primer/#column-info. Note the differences here including the addition of the name property and the titles property given as a list of undefined ('und') language strings.)

Convert CSVW file to JSON-LD

Let's say we have a CSVW metadata JSON file which references the countries.csv file...

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

... and we'd like to convert this to a dictionary in the form of JSON-LD data. Here we would use the create_annotated_table_group and create_json_ld functions:

>>> import csvw_functions
>>> annotated_table_group_dict = csvw_functions.create_annotated_table_group(
        input_file_path_or_url = 'countries-metadata.json'
        )
>>> json_ld = csvw_functions.create_json_ld(
        annotated_table_group_dict,
        mode='minimal'
        )
>>> print(json_ld)
[
    {
        "country": "at",
        "country group": "eu",
        "name (en)": "Austria",
        "name (fr)": "Autriche",
        "name (de)": "\u00d6sterreich",
        "latitude": 47.6965545,
        "longitude": 13.34598005
    },
    {
        "country": "be",
        "country group": "eu",
        "name (en)": "Belgium",
        "name (fr)": "Belgique",
        "name (de)": "Belgien",
        "latitude": 50.501045,
        "longitude": 4.47667405
    },
    {
        "country": "bg",
        "country group": "eu",
        "name (en)": "Bulgaria",
        "name (fr)": "Bulgarie",
        "name (de)": "Bulgarien",
        "latitude": 42.72567375,
        "longitude": 25.4823218
    }
]

(This example is taken from Section 4.2 of the CSVW Primer: https://www.w3.org/TR/tabular-data-primer/#transformation-values. Note here that the 'Ö' character is replaced by its Unicode equivalent.)

Convert CSVW file to RDF

Let's say we have the CSVW metadata JSON file and CSV file from the previous example, and we'd like to convert these to RDF data in Turtle notation. Now we would use the create_annotated_table_group and create_rdf functions:

>>> import csvw_functions
>>> from rdflib import Graph
>>> annotated_table_group_dict = csvw_functions.create_annotated_table_group(
        input_file_path_or_url = 'countries-metadata.json'
        )
>>> rdf_ntriples = csvw_functions.create_rdf( 
        annotated_table_group_dict,
        mode = 'minimal',
        local_path_replacement_url='http://example.org'  # use in place of the local file path.
        )
>>> rdf_ttl = Graph().parse(data = rdf_ntriples, format='ntriples').serialize(format = "ttl")
>>> print(rdf_ttl)  
@prefix ns1: <http://example.org/countries.csv#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] ns1:country "bg"^^xsd:string ;
    ns1:country%20group "eu"^^xsd:string ;
    ns1:latitude 4.272567e+01 ;
    ns1:longitude 2.548232e+01 ;
    ns1:name%20%28de%29 "Bulgarien"@de ;
    ns1:name%20%28en%29 "Bulgaria"@en ;
    ns1:name%20%28fr%29 "Bulgarie"@fr .

[] ns1:country "be"^^xsd:string ;
    ns1:country%20group "eu"^^xsd:string ;
    ns1:latitude 5.050104e+01 ;
    ns1:longitude 4.476674e+00 ;
    ns1:name%20%28de%29 "Belgien"@de ;
    ns1:name%20%28en%29 "Belgium"@en ;
    ns1:name%20%28fr%29 "Belgique"@fr .

[] ns1:country "at"^^xsd:string ;
    ns1:country%20group "eu"^^xsd:string ;
    ns1:latitude 4.769655e+01 ;
    ns1:longitude 1.334598e+01 ;
    ns1:name%20%28de%29 "Österreich"@de ;
    ns1:name%20%28en%29 "Austria"@en ;
    ns1:name%20%28fr%29 "Autriche"@fr .

(This example is taken from Section 4.2 of the CSVW Primer: https://www.w3.org/TR/tabular-data-primer/#transformation-values. Note here that 'http://example.org' is used as a sample namespace for the predicates.)

API

get_embedded_metadata

Description: This function reads a CSV file and returns any embedded metadata extracted from the CSV file. This is a useful thing to do if you only have a CSV file and want to create an initial version of its CSVW metadata JSON object. The standard approach to extracting metadata is described in Section 8. Parsing Tabular Data of the Model for Tabular Data and Metadata on the Web standard.

Call signature:

csvw_functions.get_embedded_metadata(
        input_file_path_or_url,
        relative_path=False,
        nrows=None,
        parse_tabular_data_function=parse_tabular_data_from_text_non_normative_definition,
        )

Arguments:

  • input_file_path_or_url (str): This argument is passed to the create_annotated_table_group function.
  • relative_path (bool): OPTIONAL. If True, then any absolute file paths in the returned dictionary are replaced by local file paths. Only applicable if the CSV file is a file path (not a url). Default is False.
  • nrows (int or None): OPTIONAL. This argument is passed to the create_annotated_table_group function.
  • parse_tabular_data_function (Python function): OPTIONAL. This argument is passed to the create_annotated_table_group function.

Returns: The embedded metadata of a CSV file in the form of a CSVW metadata JSON object.

Return type: dict

create_annotated_table_group

csvw_functions.create_annotated_table_group(
        input_file_path_or_url,
        overriding_metadata_file_path_or_url=None,
        validate=False,
        parse_tabular_data_function=parse_tabular_data_from_text_non_normative_definition,
        _link_header=None,  
        _well_known_text=None,  
        _save_intermediate_and_final_outputs_to_file=False,  
        _print_intermediate_outputs=False  
        )

Description: This function reads either a CSVW metadata file or a CSV file with no metadata and converts it to an Annotated Tablular Data Model as defined in Section 4. Tabular Data Models of the the Model for Tabular Data and Metadata on the Web standard. In essence this function combines the data from the CSVW metadata file and the CSV file into a single object (here as a Python dictionary) and checks for errors as this is done.

Arguments:

  • input_file_path_or_url (str): The relative file path, absolute file path or url to either a CSVW metadata document or a CSV file.
  • overriding_metadata_file_path_or_url (str): OPTIONAL. The relative file path, absolute file path or url to a metadata.json file to be used as Overriding Metadata as described in Section 5.1: Overriding Metadata of the Model for Tabular Data and Metadata on the Web standard.
  • validate (bool): OPTIONAL. If True then the process is run as a validator and any validation errors will be raised.
  • parse_tabular_data_function (Python function): OPTIONAL. This is the Python function which is used to parse the CSV file. In the csvw_functions package, the method described in Section 8. Parsing Tabular Data is implemented as a Python function named parse_tabular_data_from_text_non_normative_definition. This is used as the default method. However users could create their own parsing functions, say for an unusually formed CSV file format, and pass this function in this keyword argument instead.
  • _link_header (str): USED FOR TESTING. Provides link header text which would normally be provided through a HTTP request.
  • _well_known_text (str): USED FOR TESTING. Provides well known text which would normally be provided through a HTTP request.
  • _save_intermediate_and_final_outputs_to_file (bool): USED FOR TESTING. Writes a number of files which are generated during the process, such as the embedded metadata file, the normalised metadata file etc.
  • _print_intermediate_outputs (bool): USED FOR TESTING. Prints intermediate outputs which occur during the prodess.

Returns: A Python dictionary containing the annotated table group with a structure following the definition in Section 4. Tabular Data Models of the the Model for Tabular Data and Metadata on the Web standard. Note that this dictionary can be difficult to view using standard methods, so please use the display_annotated_table_group_dict function. The reason for this is that the annotated table group dictionary is self-referring and potentially recursive when viewed, because for example the 'table' item in a 'column' points back to the entire table which the column belongs to (which in turn contains the original column...). The use of self-referal in the output dictionary is useful when navigating 'up or down' the various items but makes it difficult to print out.

Return type: dict

display_annotated_table_group_dict

csvw_functions.display_annotated_table_group_dict(
        annotated_table_group_dict
        )

Description: This function returns a version of an annotated_table_group_dict dictionary which has the self-referring removed and can then be easily viewed and/or printed.

Arguments:

Returns: See description.

Return type: dict

get_errors

csvw_functions.get_errors(
        annotated_table_group_dict
        )

Description: This function returns a list of the cell errors present in a annotated_table_group_dict dictionary.

Arguments:

Returns: See description.

Return type: list

create_json_ld

Description: This function converts an annotated table group object to JSON-LD format. This follow the approach as given in the Generating JSON from Tabular Data on the Web standard.

Call signature:

csvw_functions.create_json_ld(
        annotated_table_group_dict,
        mode='standard',
        local_path_replacement_url=None,
        _replace_strings=None  
        )

Arguments:

  • annotated_table_group_dict (dict): This is the output of the create_annotated_table_group function.
  • mode (str): If 'standard' then the conversion is run in standard mode. If 'minimal' then the conversion is run in minimal mode. See here for details of standard vs. minimal mode. If neither 'standard' nor 'minimal' then an error is raised.
  • local_path_replacement_url (str or None): If not None then any local file paths are converted to the string provided. This is useful for testing purposes.
  • _replace_strings (list): USED FOR TESTING. A list of 2-item tuples of string replacements for the output json object.

Returns: The result of the conversion to the JSON-LD format. This is a dictionary and can be saved to a file using the json library.

Return type: dict

create_rdf

Description: This function converts an annotated table group object to JSON-LD format. This follow the approach as given in the Generating RDF from Tabular Data on the Web standard.

Call signature:

csvw_functions.create_rdf(
        annotated_table_group_dict,
        mode='standard',
        local_path_replacement_url=None
        )

Arguments:

  • annotated_table_group_dict (dict): This is the output of the create_annotated_table_group function.
  • mode (str): If 'standard' then the conversion is run in standard mode. If 'minimal' then the conversion is run in minimal mode. See here for details of standard vs. minimal mode. If neither 'standard' nor 'minimal' then an error is raised.
  • local_path_replacement_url (str or None): If not None then any local file paths are converted to the string provided. This is useful for testing purposes.

Returns: The result of the conversion to the RDF format. This is a string of RDF N-Triples. This string can be saved to a file as needed. To convert the N-Triples to another format (such as Turtle) this can be done using a dedicated RDF package such as RDFLib.

Return type: str

CVSWError

An exception, likely raised for major error or a validation error if running in validation mode.

CSVWWarning

A warning, likely raised for a validation error if not running in validation mode.

Developer Notes

  • The package is written as a series of functions rather than classes to promote reuse and because the CSVW standards are largely about transferring files from one format to another.
  • The code is all contained in a single file 'csvw_functions.py'. This is a large file of 14,000+ lines so needs a suitable IDE to navigate it. I use Spyder (part of the Anaconda distribution) which provides an automated outline view (like a table of contents) to enable navigating between different sections of the code.
  • The tests are also in a single file 'test_csvw_functions.py'. To run the tests, the CSVW Test Suite will need to be downloaded separately. This isn't included on GitHub due to its size.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvw_functions-0.0.0.tar.gz (152.3 kB view hashes)

Uploaded Source

Built Distribution

csvw_functions-0.0.0-py3-none-any.whl (140.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page