Metasheet parser, serializers, and repository manager
Project description
metasheet
A Python library for generating metadata around statsitcial and scientific datasets in various formats from an Excel or Google spreadsheet. Primarily intended for use with various MTNA packages or DDI aware tools.
This package requires Python 3.
How it works
Each sheet in the workbook is used to describe a particular type of resource, currently classification, code, variable, record, layout. Rule will be added shortly.
Each column in the sheet maps to a particular resource property. The column header row is used to detemerine such correspondence. The name in the sheet may be different from the actual property name, and these mappings are defined in a JSON configuration file using regular expressions. A default configuration is provided as a recommended version, but this can be customized. Colums for which no mapping is found are ignored.
Running the tools
From the metasheet directory, run
python metasheet -h
usage: metasheet.py [-h] [-c CONFIG] [-a] [-bq] [-ddi 2.5|3.1] [-pandas] [-rds]
[-rml] [-sts]
excel_file
Currently supported formats include:
- bq: To generate a table for hosting data in Google Big Query
- ddi: Various versions/flavors of DDI-XML (can be used with SledgeHammer)
- rds: JSON format used by MTNA Rich Data Services
- rml: a low level syntax used by MTNA Atua framework
- sts: Stat/Transfer syntax that can be used with MTNA SledgeHammer for reading data
- vertica: SQL syntax to create tables in HP Vertica
Use -a
to generate all possible outputs.
Note that this package requires Python version 3 (may work with 2).
Configuration
The package is driven by a configuration file that maps the sheet names and columns into resources and properties. This usually does not need to be changed unless you use custome properties. See default config.json
metasheet/config.json] in package source.
The configuration file provides metasheet with the information it needs to process the worksheet. This includes:
sheetRegex
: this regular expression, specified for each resources type, is used to match a sheet name with the particular type. For example,"sheetRegex":"^variable.*"
is the defualt used to match variabel sheets.propertyMaps
: holds information about mappin columns header names to resource properties. Property maps can be specified at the resource type or global level (in which case they apply to all resources. See below for
Property Maps
A property maps matches a sheet column with a specific resource property. This is use to allow different names to be used in the sheets (rather than restricting to a list of internal names). Property maps can exist at the global or resource type level, the later taking precedence.
A recource map must contain the following elements:
- regex: the regular expression that the column header name is matching.
- property: the internal property name
Optional elements can be included to suport specific serializers (e.g. rml, rds, etc.)
For example:
{"regex":"abbr|abbreviation","property":"name[abbreviation]","rml":"*"},
maps the abbr
or abbreviation
column into the property name[abbreviation]
Note that for faceted properties, a special "named" search group is introduced to determine the facet components. In Python, this takes the form ?P<facets>
as a group prefix. The following for example maps a name
column into the into name
property but would also map name[label] into name[label]
{"regex":"^name(\\[(?P<facets>.*)\\])?$","property":"name"}
Note that the Layout resource type also inherits all the properties of the variable type.◊
Resources
All resources have a uid
property generated as a UUID
The following properties can be associated with any resource (not all being relevant to all recources):
name[abbreviation]
: the container this resource will be stored in*bank
: the container that "defines / holds this resource. Every resource belongs to a bank (implicitely or explicitely).basis
: a reference to a resource of the same type from which properties can be inherited.clbank
: a classfication bank identiferid
: the resource identifier (must be unique in the workbook or bank)name
: the resource name
In most cases, the id
and name
are the same so specifying only one of them is typically sufficient.
Additional serialzier specific map properties can also be used. For example:
rml
: If present, indicates that this property should be set on RML resources. A "*" indicates that the RML property has the same name.rds
: If present, indicates that this property should be set on RDS resources. A "*" indicates that the RML property has the same name.
References
A reference to another resource must be unique within the worksheet to ensure proper resolution (). References can be specified in various ways, usually using two properties representing a bank
and an id
, or using the dot notation like bank.id
.
Inheritance
The basis
property on a resourceis a reference to another resource of the same type whose properties are inherited if not locally specified.
Facets
Certain properties can be faceted, this includes name
, description
, dataType
, and more.
Classifications
A classification is composed of the classification definition and its code list. These are captured in two seperate sheets.
Classification
The following properties are specific to the Classification:
- n/a
Code
The following properties are specific to the Code:
classification
: the classification the code belongs tovalue
: The code value
Variables
Variable must have at least a name
, a bank
, and typically a datatype.
The following properties are specific to the Variable:
classification
: a reference to the classification used by this variabledatatype
: a generic or faceted data typedecimals
: for numeric variables, the number of digits after the decimal point (0 implies integer)end
: the end column position when used in a record layout serialized in fixed ASCIIprofile
: ?start
: the start column position when used in a record layout serialized in fixed ASCIIunits
: a list generic or faceted units relevant for this variable (observation, analysis)width
: the total width of the variable content (e.g. for storing data in a fixed width ASCII file)
Record Layout
A record layout has two components/sheets: Record and Layout
Record
Capture properties for entire record layout
Required elements are: bank
, id
, name
The following properties are specific to the Record:
varbank
: a default variable bank id that applies to all variables in the layoutunit
: one or more record unit types (semi-colon separated)pk|primary
: one or more variables defining the record primary key. Compound keys variables are separate with a "+".fk|foreign
: one or more sets of variable(s) defining the record foreigns keys. Compound keys variables are separate with a "+", and multiple keys are separated by a ";".
Layout
A Layout create or add variable to a record
Variables can be either locally created or included by reference (if no additional properties are required for the record layout). This is determined by the presence of certain properties. If no other properties than the record and variable are specified, the variable in included by references. Otherwise it is locally defined. A layout technically mix both.
The following properties are specific to the Layout:
variable
: the base variable identifiervarbank
: the variable bank holding the referenced variable. This can override the record level varbank.rlbank
: the bank holding the recordrecord
: the record identifier
In addition, all the properties supported by the Variable resource can also be used (e.g. to locally define a variable)
For a Layout, the basis property (if not set) is automatically inferred from the variable and varbank properties.
Rules
Rules are used by the MTNA Atua framework quality assurance and ETL components, and mainly used at this time by our Resource Modeling Language (RML) serializer. Consult the RML documentation to better understand how the rules works and their syntax.
The following properties are specific to the Layout
resource
: the resource the rule applies to (may be supplemented by a bank properity)assert
: the expression statement to testcondition
: the condition under which the rule appliesonFail
: the action to take when the rule assertion failsonSuccess
: the action to take when the rule assertion succeedscontextVariables
: a comma separate list of variables whose values will be reported with the onPass/onFail messages
Note that rules do not require an unique identifier at this time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mtna-metasheet-0.1.1.tar.gz
.
File metadata
- Download URL: mtna-metasheet-0.1.1.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fcd3078e76f7e2db148f85fe4b0f9a885e9ae1f594217cd3968beb7001f6681 |
|
MD5 | e360e5cdb9c03db464631764ad53cf4f |
|
BLAKE2b-256 | b6ddc9a95dd65986117995173e6ebfbb4cc81d965b441febcebd321d08196427 |
File details
Details for the file mtna_metasheet-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: mtna_metasheet-0.1.1-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1c037d4ea0a9502deb980af1dbc351c3c7af0e17ec3199bbddb86f481dd13c1 |
|
MD5 | 2a889d05fa9292f8463ca20fc0c0114b |
|
BLAKE2b-256 | b71cb64c3d55356127c41f79bc57c1243957731bb379ccd18ef2f0af9013615d |