metaform

A utility for defining metadata for data types and formats.

Project description

One star knowing it all…

# HAVING DATA:
DATA = {
  'a': 1.5,
  'b': 1458266965.250572,
  'c': [{'x': {'y': 'LT121000011101001000'}}, {'z': 'Omega'}]}

# GETTING KNOWLEDGE:
SCHEMA = {
  'a': 'price#EUR|to.decimal',
  'b': 'timestamp#date|to.unixtime',
  'c': [{'*': 'contributions',
    'x': {'*': 'origins', 'y': 'account#IBAN|to.string'},
    'z': 'company#name|to.string'}],
}

# NORMALIZATION
metaform.load(DATA).format(SCHEMA)

# OR USE '*' TO REFERENCE SCHEMA TO PACK SCHEMA WITH DATA PACKETS:
metaform.load(dict(DATA, **{'*': 'https://github.com/wefindx/schema/wiki/Sale#test'})).format()

metaform

Metaform is a package for hierarchical and nested data normalization.

https://wiki.mindey.com/shared/shots/53dcf81b7efd0573f07c5f562.png

Basic Usage

pip install metaform

import metaform

If your data had an asterisk~!

# INPUT
metaform.load({
  '*': 'https://github.com/mindey/terms/wiki/person#foaf',
  'url': 'http://dbpedia.org/resource/John_Lennon',
  'fullname': 'John Lennon',
  'birthdate': '1940-10-09',
  'spouse': 'http://dbpedia.org/resource/Cynthia_Lennon'
}).format(refresh=True)
# (schemas are cached locally, pass refresh=True to redownload)

# OUTPUT
{
  '*': 'GH:mindey/terms/person#foaf',
  'jsonld:id': 'http://dbpedia.org/resource/John_Lennon',
  'foaf:name': 'John Lennon',
  'schema:birthDate': datetime.datetime(1940, 10, 9, 0, 0),
  'schema:spouse': 'http://dbpedia.org/resource/Cynthia_Lennon'
}

For example, if you read an API data source:

data = metaform.load('https://www.metaculus.com/api2/questions/')
data['*'] ='https://github.com/wefindx/schema/wiki/Topic#metaculuscom-question'
data.format()
# Try it!

Or, if your filenames had references to schema~

df = metaform.read_csv(
  'https://gist.githubusercontent.com/mindey/3f2596e108a5c151f32e1967275a7689/raw/7c4c963219255008fdb438e8b9777cd658eea02e/hello-world.csv',
  schema={
    0: 'Timestamp|to.unixtime',
    1: 'KeyUpOrDown|lambda x: x=="k↓" and "KeyDown" or (x=="k↑" and "KeyUp")',
    2: 'KeyName'},
  header=None
)

Alternatively, save schema to wiki like here, and include the schema token inside filename by encoding it as sub-extension, that is, rename hello-world.csv to hello-world.GH~mindey+schema+KeyEvent@mykeylogger-01.csv:

# To get schema token for filename (GH~mindey+schema+KeyEvent@mykeylogger-01) do:
metaform.metawiki.url2ext('https://github.com/mindey/schema/wiki/KeyEvent#mykeylogger-01')

# Then rename filename in the source, and just read file remotely or locally from disk:
df = metaform.read_csv('https://gist.githubusercontent.com/mindey/f33978b31468097b5003f032d5d85eb8/raw/9541191e4d99c052a7668223697ef0ef9ce37977/hello-world.GH~mindey+schema+KeyEvent@mykeylogger-01.csv', header=None)

So, what’s happening here?

metaform.load( DATA ).format( SCHEMA )

Let’s say we have some data:

data = {
    'hello': 1.0,
    'world': 2,
    'how': ['is', {'are': {'you': 'doing'}}]
}

We can get the template for defining schema, by metaform.template:

metaform.template(data)

{'*': '',
 'hello': {'*': ''},
 'how': [{'*': '', 'are': {'you': {'*': ''}}}],
 'world': {'*': ''}}

This provides an opportunity to specify metadata for each key and the object itself. For example:

schema = {                        # A #    schema = {
    '*': 'greeting',              # L #        '*': 'greeting',
    'hello': {'*': 'length'},     # T #        'hello': 'length',
    'world': {'*': 'atoms'},      # E #        'world': 'atoms',
    'how': [                      # R #        'how': [
         {'*': 'method',          # N #             {'*': 'method',
          'are': {                # A #              'are': {
              '*': 'yup',         # T #                  '*': 'yup',
              'you': {'*': 'me'}} # I #                  'you': {'*': 'me'}}
         }                        # V #             }
    ]}                            # E #        ]}

metaform.normalize(data, schema)

{'atoms': 2, 'length': 1.0, 'method': ['is', {'yup': {'me': 'doing'}}]}

We recommend saving schemas you create for normalizations for data analytics and driver projects in dot-folders .schema, in a JSON or YAML files in that folder.

So, we have access to all keys, and can specify, what to do with them:

schema = {
    '*': 'greeting',
    'hello': 'length|lambda x: x+5.',
    'world': 'atoms|lambda x: str(x)+"ABC"',
    'how': [
         {'*': 'method',
          'are': {
              '*': 'yup',
              'you': {'*': 'me|lambda x: "-".join(list(x))'}}
         }
    ]}

metaform.normalize(data, schema)

{'atoms': '2ABC',
 'length': 6.0,
 'method': ['is', {'yup': {'me': 'd-o-i-n-g'}}]}

And suppose, we want to define a more complex function, inconvenient via lambdas:

from metaform import converters

def some_func(x):
    a = 123
    b = 345
    return (b-a)*x

converters.func = some_func

schema = {
    '*': 'greeting',
    'hello': 'length|to.func',
    'world': 'atoms|lambda x: str(x)+"ABC"',
    'how': [
         {'*': 'method',
          'are': {
              '*': 'yup',
              'you': {'*': 'me|lambda x: "-".join(list(x))'}}
         }
    ]}

metaform.normalize(data, schema)

{'atoms': '2ABC',
 'length': 222.0,
 'method': ['is', {'yup': {'me': 'd-o-i-n-g'}}]}

We just renamed the keys, and normalized values! What else could we want?

Normalizing Data

Suppose we have similar data from different sources. For example, topics and comments are not so different after all, because if a comment becomes large enough, it can stand as a topic of its own.

topics = requests.get('https://api.infty.xyz/topics/?format=json').json()['results']
comments = requests.get('https://api.infty.xyz/comments/?format=json').json()['results']

Let’s define templates for them, with the key names and types to match:

topics_schema = [{
  'id': 'topic-id',
  'type': '|lambda x: {0: "NEED", 1: "GOAL", 2: "IDEA", 3: "PLAN", 4: "STEP", 5: "TASK"}.get(x)',
  'owner': {'id': 'user-id'},
  'blockchain': '|lambda x: x and True or False',
}]

normal_topics = metaform.normalize(topics, topics_schema)

topics_df = pandas.io.json.json_normalize(normal_topics)
topics_df.dtypes

blockchain             bool
body                 object
categories           object
categories_names     object
children             object
comment_count         int64
created_date         object
data                 object
declared            float64
editors              object
funds               float64
is_draft               bool
languages            object
matched             float64
owner.user-id         int64
owner.username       object
parents              object
title                object
topic-id              int64
type                 object
updated_date         object
url                  object
dtype: object

comments_schema = [{
  'id': 'comment-id',
  'topic': 'topic-url',
  'text': 'body',
  'owner': {'id': 'user-id'},
  'blockchain': '|lambda x: x and True or False',
}]

normal_comments = metaform.normalize(comments, comments_schema)

comments_df = pandas.io.json.json_normalize(normal_comments)
comments_df.dtypes

assumed_hours      object
blockchain           bool
body               object
claimed_hours      object
comment-id          int64
created_date       object
donated           float64
languages          object
matched           float64
owner.user-id       int64
owner.username     object
parent             object
remains           float64
topic-url          object
updated_date       object
url                object
dtype: object

df = pandas.concat([topics_df, comments_df], sort=False)

Project details

Release history Release notifications | RSS feed

This version

1.0.2.4

Jun 9, 2020

1.0.2.3

Jun 9, 2020

1.0.2.2

Nov 11, 2019

1.0.2.1

Nov 11, 2019

1.0.2.0

Nov 11, 2019

1.0.1.9

Nov 11, 2019

1.0.1.8

Nov 9, 2019

1.0.1.7

Nov 9, 2019

1.0.1.6

Nov 9, 2019

1.0.1.5

Nov 9, 2019

1.0.1.4

Nov 9, 2019

1.0.1.3

Nov 7, 2019

1.0.1.2

Oct 24, 2019

1.0.1.1

Oct 21, 2019

1.0.1

Oct 21, 2019

1.0.0

May 5, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaform-1.0.2.4.tar.gz (18.1 kB view details)

Uploaded Jun 9, 2020 Source

File details

Details for the file metaform-1.0.2.4.tar.gz.

File metadata

Download URL: metaform-1.0.2.4.tar.gz
Upload date: Jun 9, 2020
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.7

File hashes

Hashes for metaform-1.0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`82ad18862db4c0b5423128c8a8150d23ffdecf98ed630cd179a6e4309b24b55f`
MD5	`304cf17bf7a261af65304b3be5b45f85`
BLAKE2b-256	`796bf365b0fcf0c99133085b9c3b93928f78f913ebe10516b33e4ed4a1473ffd`

See more details on using hashes here.

metaform 1.0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

metaform

Basic Usage

If your data had an asterisk~!

For example, if you read an API data source:

Or, if your filenames had references to schema~

So, what’s happening here?

Normalizing Data

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes