Skip to main content

A shape language for arbitrary data

Project description

experimental

And-Or Shape (aos) Language

Writing data pipelines involves complex data transformations over nested data, e.g., list of dictionaries or dictionary of tensors.

  • The shape of nested data is not explicit in code and hence not accessible readily to the developer.
  • Leads to cognitive burden (guessing shapes), technical debt and inadvertent programming errors.
  • Data pipelines are very opaque to examination and comprehension.

aos is a compact, regex-like language for describing the shapes (schemas) of both homogeneous (tensors) and heterogeneous (dictionaries, tables) data, and combinations, independent of the specific data library.

  • Based on an intuitive regex-like algebra of data shapes.
  • Infer aos shape from a data instance: aos.infer.infer_aos.
  • Validate data against aos shapes anywhere: aos.checker.instanceof.
  • Transform data using aos shapes, declaratively: aos.tfm.do_tfm.
  • Allows writing explicit data shapes, inline in code. In Python, use type annotations.
  • Write shapes for a variety of data conveniently -- Python native objects (dict, list, scalars), tensors (numpy, pytorch, tf), pandas,hdf5,tiledb,xarray,struct-tensor, etc.

Installation

pip install aos

Shape of Data ?

Consider a few quick examples.

  • the shape of scalar data is simply its type, e.g., int, float, str, ...
  • for nested data, eg. list of ints: (int)*
  • for a dictionary of form {'a': 3, b: 'hi'} : shape is (a & int) | (b & str).

Now, we can describe the shape of arbitrary, nested data with these &(and)- |(or) expressions. Intuitively, a list is an or-structure, a dictionary is an or of ands, a tensor is an and-structure, and so on.

  • Why is a list an or-structure? Ask: how do we access any value v in the list? Choose some index of the list, corresponding to the value v.
  • Similarly, a dictionary is an or-and structure: we pick one of the keys, together (and) with its value.
  • In contrast, an n-dimensional tensor has an and-shape: we must choose indices from all the dimensions of the tensor to access a scalar value.
  • In general, for a data structure, we ask: what choices must we make to access a scalar value?

Thinking in terms of and-or shapes takes a bit of practice initially. Read more about the and-or expressions here.

More complex aos examples

  • Lists over shape s are denoted as (s)*. Shorthand for (s|..|s).
  • Dictionary: (k1 & v1) | (k2 & v2) | ... | (kn & vn) where ki and vi is the ith key and value.
  • Pandas tables: (n & ( (c1&int)| (c2&str) | ... | (cn&str) ) where n is the row dimension (the number of rows) and c1,...,cn are column names.

The aos expressions are very compact. For example, consider a highly nested Python object X of type

Sequence[Tuple[Tuple[str, int], Dict[str, str]]]

This is both verbose and hard to interpret. Instead, X's aos is written compactly as ((str|int) | (str : str))* .

The full data shape may be irrelevant in many cases. To keep it brief, the language supports wildcards: _ and ... to allow writing partial shapes.

So, we could write a dictionary's shape as (k1 & ...)| ... | (kn & ...).

Shape Inference

Unearthing the shape of opaque data instances, e.g., returned from a web request, or passed into a function call, is a major pain.

  • Use aos.infer.infer_aos to obtain compact shapes of arbitrary data instances.
  • From command line, run aos-infer <filename.json>
from aos.infer import infer_aos

def test_infer():

  d = {
      "checked": False,
      "dimensions": { "width": 5, "height": 10},
      "id": 1,
      "name": "A green door",
      "price": 12.5,
      "tags": ["home","green"]
  }

  infer_aos(d) 

  # ((checked & bool) 
  # | (dimensions & ((width & int) | (height & int)))
  # | (id & int) | (name & str) | (price & float) | (tags & (str *)))

  dlist = []
  for i in range(100):
      d['id'] = i
      dlist.append(d.copy())

  infer_aos(dlist) 

  # ((checked & bool) 
  # | (dimensions & ((width & int) | (height & int)))
  # | (id & int) | (name & str) | (price & float) | (tags & (str *)))*

Shape/Schema Validation

Using aos.checker.instanceof, we can

  • write aos assertions to validate data shapes (schemas).
  • validate data structure partially using placeholders: _ matches a scalar, ... matches an arbitrary object (sub-tree).
  • works with python objects, pandas, numpy, ..., extensible to other data types (libraries).
from aos.checker import instanceof

def test_pyobj():
    d = {'city': 'New York', 'country': 'USA'}
    t1 = ('Google', 2001)
    t2 = (t1, d)

    instanceof(t2, '(str | int) | (str & str)') #valid
    instanceof(t2, '... | (str & _)') #valid
    instanceof(t2, '(_ | _) | (str & int)') #error

    tlist = [('a', 1), ('b', 2)]
    instanceof(tlist, '(str | int)*') #valid

def test_pandas():
    d =  {'id': 'CS2_056', 'cost': 2, 'name': 'Tap'}
    df = pd.DataFrame([d.items()], columns=list(d.keys()) )

    instanceof(df, '1 & (id | cost | name)')

def test_numpy():
    #arr = np.array()
    arr = np.array([[1,2,3],[4,5,6]]) 
    instanceof(arr, '2 & 3')

def test_pytorch():
    #arr = np.array()
    arr = torch.tensor([[1,2,3],[4,5,6]])
    instanceof(arr, '2 & 3')

Transformations with AOS

Because aos expressions can both match and specify heterogeneous data shapes, we can write aos rules to transform data.

The rules are written as lhs -> rhs, where both lhs and rhs are aos expressions:

  • lhs matches a part (sub-tree) of the input data instance I.
  • query variables in the lhs capture (bind with) parts of I.
  • rhs specifies the expected shape (aos) of the output data instance O.

To write rules, ask: which parts of I, do we need to construct O ?

from aos.tfm import do_tfm
def tfm_example():
    # input data
    I = {'items': [{'k': 1}, {'k': 2}, {'k': 3}],
        'names': ['A', 'B', 'C']}

    # specify transformation (left aos -> right aos)
    # using `query` variables `k` and `v`

    # here `k` binds with each of the keys in the list and 
    # `v` binds with the corresponding value
    # the `lhs` automatically ignores parts of I, which are irrelevant to O

    tfm = 'items & (k & v)* -> values & (v)*'

    O = do_tfm(I, tfm)
    print(O) # {'values': [1, 2, 3]}

The above example illustrates a simple JSON transformation using aos rules. Rules can be more complex, e.g., include conditions, function application on query variables. They work not only with JSON data, but also apply to heterogeneous nested objects.

See more examples here and here.

And-Or Shape Dimensions

The above examples of use strings or type names (str) or integer values (2,3) in shape expressions. A more principled approach is to first declare dimension names and define shape over these names.

Data is defined over two kinds of dimensions:

  • Continuous. A range of values, e.g., a numpy array of shape (5, 200) is defined over two continuous dimensions, say n and d, where n ranges over values 0-4 and d ranges over 0-199.
  • Categorical. A set of names, e.g., a dictionary {'a': 4, 'b': 5} is defined over keys (dim names) ['a', 'b']. One can also view each key, e.g., a or b , as a Singleton dimension.

Programmatic API. The library provides an API to declare both type of dimensions and aos expressions over these dimensions, e.g., declare n and d as two continuous dimensions and then define shape n & d.

Status

The library is under active development. More documentation coming soon..

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aos-0.1.2.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

aos-0.1.2-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file aos-0.1.2.tar.gz.

File metadata

  • Download URL: aos-0.1.2.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.35.0 CPython/3.6.7

File hashes

Hashes for aos-0.1.2.tar.gz
Algorithm Hash digest
SHA256 791058728ca3eddb83ace0b6d30557c7cb466de7ac3ab76bdd45dcf63a02189b
MD5 d57c00ad77c85a15d86860c5109e6564
BLAKE2b-256 32dfd00187838460d30ddb3b1cd59da357578d4cfb8f65b4b7227861d8783db3

See more details on using hashes here.

File details

Details for the file aos-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: aos-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.35.0 CPython/3.6.7

File hashes

Hashes for aos-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 11716e05ca4fa3c841437837fd42ded3a203db404c0df7b046e1aba773af9c5e
MD5 2ee8a50a4dd19f9b9e452effc8e665c5
BLAKE2b-256 92c94146b882149a9319f9350303a041bd9ccef49283cd09e51f7dd88ae3fea7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page