Package to author schemas using spreadsheets

These details have been verified by PyPI

Maintainers

cmungall mamillerpa smoxon sujaypatil96

These details have not been verified by PyPI

Project links

Project description

schemasheets - structuring your data using spreadsheets

Create a data dictionary / schema for your data using simple spreadsheets - no coding required.

Author your schema as a google sheet or excel spreadsheet
Generate schemas:
- LinkML
- SHACL and ShEx
- JSON-Schema
- SQL DDL
- OWL
Validate data automatically

See the test google sheets for examples

See also the examples folder which has an end-to-end example

How it works

The following example shows a schema sheet for a schema that is focused around the concept of a Person. The rows in the sheet describe either classes or slots (fields)

record	field	key	multiplicity	range	desc	schema.org
`>` class	slot	identifier	cardinality	range	description	`exact_mappings: {curie_prefix: sdo}`
-	id	yes	1	string	any identifier	identifier
-	description	no	0..1	string	a textual description	description
Person		n/a	n/a	n/a	a person,living or dead	Person
Person	id	yes	1	string	identifier for a person	identifier
Person, Organization	name	no	1	string	full name	name
Person	age	no	0..1	decimal	age in years	-
Person	gender	no	0..1	decimal	age in years	-
Person	has medical history	no	0..*	MedicalEvent	medical history	-
MedicalEvent		n/a	n/a	n/a	-	-

personinfo google sheet

The sheet is structured as follows:

The first line is a header line. You get to decide the column headers
Subsequent lines starting with > are column descriptors
- these provide a way to interpret the columns
- descriptors can be drawn from the linkml vocabulary
Remaining rows are elements of your schema
- Each element gets its own row
- A row can represent a class (record, table), field (column), enumeration, or other element types

The most basic schema concepts are classes and slots

classes represent record types, similar to tables in a database or sheets in a spreadsheet
slots represent fields, similar to columns in a database or spreadsheet

These can be used in combination:

If a class is provided, but a slot is not, then the row represents a class.
If a slot is provided, but a class is not, then the row represents a slot (field)
If both class and slot are provided, then the row represents the usage of a slot in the context of a class

Generating schemas

Assuming your schema is arranged as a set of sheets (TSV files) in the src folder:

sheets2project -d . src/*.tsv

This will generate individual folders for jsonschema, shacl, ... as well as a website that can be easily hosted on github.

To create only LinkML yaml:

schemasheets -o my.yaml  src/*.tsv

Simple data dictionaries

This framework allows you to represent complex relation-style schemas using spreadsheets/TSVs. But it also allows for representation of simple "data dictionaries" or "minimal information lists". These can be thought of as "wide tables", e.g. representing individual observations or observable units such as persons or samples.

TODO

Prefixes

If you specify a column descriptor of prefix, then rows with that column populated will represent prefixes. The prefix expansion is specified using prefix_reference

Example:

prefix	URI
`>` prefix	prefix_reference
sdo	http://schema.org/
personinfo	https://w3id.org/linkml/examples/personinfo/
famrel	https://example.org/FamilialRelations#
GSSO	http://purl.obolibrary.org/obo/GSSO_

We recommend you specify prefixes in their own sheet.

If prefixes are not provided, and you do not specify --no-repair then prefixes will be inferred using bioregistry

Schema-level metadata

If you specify a column descriptor of schema, then rows with that column populated will represent schemas.

Example:

Schema	uri	Desc	Schema Prefix
`>` schema	id	description	default_prefix
PersonInfo	https://w3id.org/linkml/examples/personinfo	Information about people, based on schema.org	personinfo

The list of potential descriptors for a schema can be found by consulting SchemaDefinition in the LinkML metamodel.

Both id and name are required, these will be auto-filled if you don't fill this in.

Populating the fields description and license is strongly encouraged.

Currently multiple schemas are not supported, we recommend providing a single-row sheet for schema metadata

Enums

Two descriptors are provided for enumerations:

enum
permissible_value

These can be used in combination:

If enum is provided, and permissible_value is not, then the row represents an enumeration
If both enum and permissible_value are provided, the row represents a particular enum value

The following example includes two enums:

ValueSet	Value	Mapping	Desc
`>` enum	permissible_value	meaning	description
FamilialRelationshipType	-	-	familial relationships
FamilialRelationshipType	SIBLING_OF	famrel:01	share the same parent
FamilialRelationshipType	PARENT_OF	famrel:02	biological parent
FamilialRelationshipType	CHILD_OF	famrel:03	inverse of parent
GenderType	-	-	gender
GenderType	nonbinary man	GSSO:009254	-
GenderType	nonbinary woman	GSSO:009253	-
...	...	...	-

enums google sheet

All other descriptors are optional, but we recommend you provide descriptions of both the enumeration and the meaning descriptor which maps the value to a vocabulary or ontology term.

For more on enumerations, see the linkml tutorial

Specifying cardinality

See configschema.yaml for all possible vocabularies, these include:

UML strings, e.g. '0..1'
text strings matching the cardinality vocabulary, e.g. 'zero to one'
codes used in cardinality vocabulary, e.g. O, M, ...

The vocabulary maps to underlying LinkML primitives:

Slot-class grids

If you have a large number of fields/columns, with varying applicability/cardinality across different classes, it can be convenient to specify this as a grid.

An example is a minimal information standard that includes different packages or checklists, e.g. MIxS.

For example:

term	title	desc	mi_patient	mi_mod	mi_terrestrial	mi_marine	mi_extraterrestrial
`>` slot	title	description	cardinality	cardinality	cardinality	cardinality	cardinality
`>`			`applies_to_class: MI patient`	`applies_to_class: MI model organism`	`applies_to_class: MI terrestrial sample`	`applies_to_class: MI marine sample`	`applies_to_class: MI extraterrestrial sample`
id	unique identifier	a unique id	M	M	M	M	M
alt_ids	other identifiers	any other identifiers	O	O	O	O	O
body_site	body site	location where sample is taken from	M	R	-	-	-
disease	disease status	disease the patient had	M	O	-	-	-
age	age	age	M	R	-	-	-
depth	depth	depth in ground or water	-	-	R	R	R
alt	altitude	height above sea level			R	R	R
salinity	salinity	salinity			R	R	R
porosity	porosity	porosity
location	location	location on earth
astronomical_body	astronomical body	planet or other astronomical object where sample was collected					M

data dictionary google sheet

Here the applies_to_class descriptor indicates that the column value for the slot indicated in the row is interpreted as slot usage for that class.

Metatype fields

In all of the examples above, distinct descriptors are used for class names, slot names, type names, enum names, etc

An alternative pattern is to mix element types in a single sheet, indicate the name of the element using name and the type using metatype.

For example:

type	item	applies to	key	multiplicity	range	parents	desc	schema.org	wikidata	belongs	status	notes
`>` metatype	name	class	identifier	cardinality	range	is_a	description	`exact_mappings: {curie_prefix: sdo}`	`exact_mappings: {curie_prefix: wikidata}`	in_subset	status	ignore
`> vmap: {C: class, F: slot}`
`>`									curie_prefix: wikidata		`vmap: {T: testing, R: release}`
F	id		yes	1	string		any identifier	identifier
F	name	Person, Organization	no	1	string		full name	name
F	description		no	0..1	string		a textual description	description
F	age	Person	no	0..1	decimal		age in years
F	gender	Person	no	0..1	decimal		age in years
F	has medical history	Person	no	0..*	MedicalEvent		medical history				T
C	Person						a person,living or dead	Person	Q215627		R
C	Event						grouping class for events		Q1656682	a	R
C	MedicalEvent					Event	a medical encounter			b	T
C	ForProfit					Organization
C	NonProfit					Organization			Q163740			foo

personinfo with tyoes

Formal specification

In progress. The following is a sketch. Please refer to the above examples for elucidation.

The first line is a HEADER line.
- Each column must be non-null and unique
- In future grouping columns may be possible
Subsequent lines starting with > are column configurations
- A column configuration can be split over multiple lines
- Each line must be a valid yaml string (note that a single token is valid yaml for that token)
- The first config line must include a descriptor
- Subsequent lines are settings for that descriptor
- A descriptor can be one of:
  - Any LinkML metamodel slot (e.g. description, comments, required, recommended, multivalued)
  - The keyword cardinality
  - An element metatype (schema, prefix, class, enum, slot, type, subset, permissible_value)
- Setting can be taken from configschema.yaml
  - vmap provides a mapping used to translate column values. E.g. a custom "yes" or "no" to "true" or "false"
  - various keys provide ways to auto-prefix or manipulate strings
Remaining rows are elements of your schema
- Each element gets its own row
- A row can represent a class (record, table), field (column), enumeration, or other element types
- The type of the row is indicated by whether columns with metatype descriptors are filled
  - E.g. if a column header "field" has a descriptor "slot" then any row with a non-null value is interpreted as a slot
- If a metatype descriptor is present then this is used
- A row must represent exactly one element type
- If both class and slot descriptors are present then the row is interpreted as a slot in the context of that class (see slot_usage)
All sheets/TSVs are combined together into a single LinkML schema as YAML
This LinkML schema can be translated to other formats as per the LinkML generators

Working with files / google sheets

This tool takes as input a collection of sheets, which are stored as TSV files.

You can make use of various ways of managing/organizing these:

TSVs files maintained in GitHub
Google sheets
Excel spreadsheets
SQLite databases

Tips for each of these and for organizing your information are provided below

Multiple sheets vs single sheets

It is up to you whether you represent your schema as a single sheet or as multiple sheets

However, if your schema includes a mixture of different element types, you may end up with a lot of null values if you have a single sheet. It can be more intuitive to "normalize" your schema description into different sheets:

sheets for classes/slots
sheets for enums
sheets for types

Currently schemasheets has no built in facilities for interacting directly with google sheets - it is up to you to both download and upload these

TODO: scripts for merging/splitting sheets

Manual upload/download

Note that you can create a URL from a google sheet to the TSV download - TODO

COGS

We recommend the COGS framework for working with google sheets

cogs

A common pattern is a single sheet document for a schema, with different sheets/tabs for different parts of the schema

TODO: example

Working with Excel spreadsheets

Currently no direct support, it is up to you to load/save as individual TSVs

Working with SQLite

Project details

These details have been verified by PyPI

Maintainers

cmungall mamillerpa smoxon sujaypatil96

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Aug 3, 2024

0.3.0

Jul 30, 2024

0.2.1

Mar 4, 2024

0.2.0

Feb 27, 2024

0.1.24

Sep 29, 2023

0.1.23

Jul 17, 2023

0.1.22

Jul 14, 2023

0.1.21

Mar 30, 2023

0.1.20

Mar 29, 2023

0.1.19

Feb 14, 2023

0.1.17

Dec 6, 2022

0.1.16

Oct 3, 2022

0.1.15

Sep 16, 2022

0.1.14

Aug 26, 2022

0.1.13

Jul 20, 2022

0.1.12

Jul 13, 2022

0.1.11

Jul 12, 2022

0.1.10

Apr 28, 2022

0.1.9

Mar 16, 2022

0.1.8

Mar 8, 2022

0.1.7

Feb 8, 2022

0.1.6

Feb 3, 2022

0.1.5

Feb 2, 2022

This version

0.1.4

Jan 13, 2022

0.1.1

Jan 11, 2022

0.1.0

Jan 11, 2022

0.0.1

Mar 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemasheets-0.1.4.tar.gz (23.8 kB view hashes)

Uploaded Jan 13, 2022 Source

Built Distribution

schemasheets-0.1.4-py3-none-any.whl (21.2 kB view hashes)

Uploaded Jan 13, 2022 Python 3

Hashes for schemasheets-0.1.4.tar.gz

Hashes for schemasheets-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`eaaeeea810752c8cab8f93048894c5c76cd42529da972031d3fed7af16dc2c3c`
MD5	`300985c23e68ea855b4e67d22f650468`
BLAKE2b-256	`e85b76ea7d798662292051632d423ec5fc957dee84be3ced5f627352447fa81b`

Hashes for schemasheets-0.1.4-py3-none-any.whl

Hashes for schemasheets-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12bdcabfc8bc37cecf2c2140b9afdbab3af6f8f79c8c20fff33cdf8dce11f7a8`
MD5	`bf1a5f082193b3f54d755ae70e6d57d2`
BLAKE2b-256	`1d16d3a1911530a090bf8e904ebb7a6628316b6da74ad9221853dd1d172f70bb`