Skip to main content

A simple package that supports data catalogging in a spreadsheet and publishes the data catalog as markdown on git(hub)

Project description

SimpleMDDataCatalog

simple data catalog based on dcat that generates MD to be published as a static website using frameworks like github pages, hugo or jekyll.

Motivation

Data catalogs are powerful tools in managing data. Whether it is for a small project or a giant organization. There are many good (both open and closed source) data cataloging applications out there that this one doesn't aim to replace, however, most of them require the owners/publishers to have access to cloud computing environments or have their own server. The barrier to entry is quite high for reasons that have everything to do with server management and nothing to do with data management. This project aims to create a low-barrier to entry data catalog making use of:

  • excel/spreadsheets for data entry
    • while spreadsheets are generally a poor choice for (meta) data management, it opens up participation in the data cataloging process to a wider range of individuals
    • as users become more familiar with (and power users of) data cataloging, it is recommended to migrate to more robust data cataloging applications. with this future migration in mind. This tool takes care of transforming the catalog into a standardized format (using dcat and dqv and skos). This ensures that users who start here have room to grow without incurring technical debt for starting simple.
  • markdown for static site generation
    • using github's pages functionality for the simplest version
    • using more advanced static site frameworks like hugo and jekyll for those who feel comfortable with those and want to be able to customize the layout with themes or access to other functionalities that those frameworks give access to

A very simple example of the data catalog that is generated can be found here.

Features

The data catalog aims to give the following overview:

  • datasets and the formats in which they are published
    • organize data by (user defined) key words
  • (user defined) data quality metrics and measurements
  • data lineage
    • and data lineage/supply chain data quality metrics

An auxiliary motivation is to introduce users to subjects like data cataloging, data quality management and data lineage, by providing a tool that addresses these concepts in a basic way.

On privacy and security

This project allows users to generate a data catalog website relying on static site generation.

When using this function in its most basic form (making use of github pages) write access is managed through the github repository where the data is stored. Read access is wide open for public repositories, or, for private repositories however the organization/user has managed access in another way.

Given this rather crude approach (its a feature, not a bug) to read/write access, users are advised to think carefully about what they publish and who has access to it. Especially when data privacy laws (like the GDPR) are concerned, it is advised to not publish any person identifiable information (for instance in the dcterms:contactPoint field) as doing so typically comes with the legal requirement to introduce (potentially) complex data management processes (that cannot be classified as 'low barrier to entry' any longer).

Datamodel

The data catalog understands the following information.

data model

The datamodel is based on DCAT, SKOS, DQV and a little bit of PROV. For the definitions of each of the classes and attributes, the reader is referred to the respective standards. While al of these standards support a wide variety of these concepts and attributes, this project takes a rather opinionated approach to applying these definitions. While this constraints the expressivity that these standards offer, it allows for the data catalog to remain 'Simple'.

Using a spreadsheet as input

While directly editing the RDF/ttl file gives much more flexibility and control, the idea is that using a simple spreadsheet is sufficient for being able to create a simple data catalog. In this section you will find instructions on how to fill in the excel spreadsheet. AN example of the spreadsheet can be found here it is recommended to make a copy of this template and use it.

The spreadsheet has 6 tabs:

  • DataCatalog: this tab can only have 1 entry
  • DataSets: data for the dataset records
  • Distributions: data for the different distributions of the dataset
  • Concepts: the definitions of the keywords/themes with which the datasets are annotated
  • Metrics: the definition of data quality metrics with which the quality of data can be measured
  • QualityMeasurements: quality measurements of specific datasets

DataCatalog

This first tab of the spreadsheet contains information about the data catalog itself.

The definition of data catalog, according to DCAT is:

A curated collection of metadata about resources.

This information will become part of the landing page of the data catalog. NB: Please make sure this tab only contains a single record!

attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
dcterms:title The title of the data catalog as text no
dcterms:description A more elaborate description of the dataset yes
dcterms:licence Either a url to a license document or the name of a common license (like cc-by-4.0) yes
dcterms:publisher Either a url to the website of the publishing organization or the name of the publisher yes
dcat:theme A comma separated list of key-words. These key-words also need to be defined in the 'Concepts' tab (make sure they are spelled the same, case sensitive), see below yes

Dataset

This tab contains the data sets. Each row is a different dataset. The definition of Dataset according to DCAT is:

A collection of data, published or 
curated by a single agent, and available 
for access or download in one or more 
representations.

For datasets, the follwing information can be entered:

attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
dcterms:title The title of the data set as text no
dcterms:description A more elaborate description of the dataset yes
dcterms:publisher Either a url to the website of the publishing organization or the name of the publisher yes
dcat:contactPoint Either a url to a website with contact information or an email address yes
dcterms:licence Either a url to a license document or the name of a common license (like cc-by-4.0) yes
dcat:version Version information of the dataset. semantic versioning example: 1.0.4 yes
dcat:theme A comma separated list of key-words. These key-words also need to be defined in the 'Concepts' tab (make sure they are spelled the same, case sensitive), see below yes
dcterms:spatial A description of the region the dataset covers. For example: Ireland yes
dcterms:temporal/time:hasBeginning The start of the time that is covered by the dataset. For example: 2024 yes
dcterms:temporal/time:hasEnd The end of the time that is covered by the dataset. For example: 2024 yes
adms:status Status information of the dataset. For example: "test" or "deprecated" yes
prov:wasDerivedFrom Data lineage information. A comma separated list of urls and/or dcterms:identifiers of other datasets that were used to produce this one. For example: 12345, 56789 yes
dcat:distribution The distributions that are available of this dataset. A comma separated list of dcterms:identifiers of entries in the Distributions tab (see below) yes
dcterms:modified The date at which the dataset was last modified yes

Distributions

This tab contains information on the distributions of the datasets. The definition of distribution according to DCAT is:

A specific representation of a dataset. 
A dataset might be available in multiple 
serializations that may differ in various 
ways, including natural language, 
media-type or format, schematic 
organization, temporal and spatial 
resolution, level of detail or profiles 
(which might specify any or all of the above). 
attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
dcterms:description A more elaborate description of the distribution yes
dcat:acccessURL The url to where the distribution can be obtained. yes
dcterms:format The file format/serialization of the distribution. For example: 'csv' or 'excel' yes
dcat:version Version information of the distribution. semantic versioning example: 1.0.4 yes
dcterms:modified The date at which the distribution was last modified yes

Concepts

This tab contains definition information about the keywords that are used to annotate the datasets and the data catalog. The data in this tab is conform SKOS (Simmple Knowledge Organization System). SKOS defines Concept as:

A SKOS concept can be viewed as an idea 
or notion; a unit of thought. However, 
what constitutes a unit of thought is 
subjective, and this definition is meant 
to be suggestive, rather than restrictive.
attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
skos:prefLabel The preferred label (word) for the concept no
skos:definition The definition of the term. yes
skos:example Any examples of the term in its use. yes
skos:altLabel Any alternative labels (words) for the same term, comma separated if there are more than one yes

Metrics

This tab contains information about the metrics with which data quality are evaluated. The data in this section is modelled to comply with the DQV (Data Quality Vocabulary).

DQC defines Metric as

Represents a standard to measure a quality 
dimension. An observation (instance of dqv:QualityMeasurement) assigns a value 
in a given unit to a Metric. 
attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
skos:prefLabel The preferred label (word) for the metric. no
skos:definition The definition of the metric. It helps to describe in detail what the metric aims to measure and how it measures it. No
dqv:expectedDataType The Datatype that a measurement of this metric would have. It is advised (but nor required) to stick to XSD datatypes No
dqv:inDImension The quality dimension that the metric aims to capture. It is preferred to use ISO Quality Dimensions. yes

QualityMeasurements

This tab contains information on any quality measurements that have been performed on the datasets. The DQV defines a QualityMeasure

Represents the evaluation of a given 
dataset (or dataset distribution) against 
a specific quality metric. 
attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
dqv:computedOn the dcterms:identifier of the dataset on which this measurement was taken no
dqv:isMeasurementOf The dcterms:identifier of the Metric that this measurement measured against no
dqv:value The value of the quality measurement. no
prov:generatedAtTime The date/datetime at which the measurement was done. yes

DatasetSeries

This tab contains the data set series. Each row is a different series. The definition of DatasetSeries according to DCAT is:

A collection of datasets that are published separately, but share some characteristics that group them. 

For DatasetSeries, the following information can be entered:

attribute instruction optional?
dcterms:identifier A unique identifier without any spaces. It is recommended to use a uuid these can be generated using a tool like this no
dcterms:title The title of the data set as text no
dcterms:description A more elaborate description of the dataset yes
dcterms:publisher Either a url to the website of the publishing organization or the name of the publisher yes
dcat:contactPoint Either a url to a website with contact information or an email address yes
dcterms:licence Either a url to a license document or the name of a common license (like cc-by-4.0) yes
dcat:version Version information of the dataset. semantic versioning example: 1.0.4 yes
dcat:theme A comma separated list of key-words. These key-words also need to be defined in the 'Concepts' tab (make sure they are spelled the same, case sensitive), see below yes
dcterms:spatial A description of the region the dataset covers. For example: Ireland yes
dcterms:temporal/time:hasBeginning The start of the time that is covered by the dataset. For example: 2024 yes
dcterms:temporal/time:hasEnd The end of the time that is covered by the dataset. For example: 2024 yes
adms:status Status information of the dataset. For example: "test" or "deprecated" yes
prov:wasDerivedFrom Data lineage information. A comma separated list of urls and/or dcterms:identifiers of other datasets that were used to produce this one. For example: 12345, 56789 yes
dcat:distribution The distributions that are available of this dataset. A comma separated list of dcterms:identifiers of entries in the Distributions tab (see below) yes
dcterms:modified The date at which the dataset was last modified yes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplemddatacatalog-0.1.13.tar.gz (252.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simplemddatacatalog-0.1.13-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file simplemddatacatalog-0.1.13.tar.gz.

File metadata

  • Download URL: simplemddatacatalog-0.1.13.tar.gz
  • Upload date:
  • Size: 252.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for simplemddatacatalog-0.1.13.tar.gz
Algorithm Hash digest
SHA256 5b36d16b76cf9c2d6c9c3b22bb1d1e3fd899b706a57bbe5a32cf20b64ecb56cb
MD5 77841d8ceae59a29b3a073ae29dfd977
BLAKE2b-256 f67307426834f95c530e87a888245484fc10b218acbe59d97fdb9202f85ec408

See more details on using hashes here.

File details

Details for the file simplemddatacatalog-0.1.13-py3-none-any.whl.

File metadata

File hashes

Hashes for simplemddatacatalog-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 06ca4e1af3157e4b7d3ce10bc6b8a10cffe4e81565e1f272cf6a2c5c7583e4df
MD5 85a614497bdff074421fcb39ce63629d
BLAKE2b-256 0776bf940df3b5d404210dc5d93a2975af53052f4c3c4e1c79d0cd3c4ec3fafa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page