TAB-dataset : A tool for structuring tabular data

These details have not been verified by PyPI

Project links

Homepage

Project description

TAB-dataset : A tool for structuring tabular data

TAB-dataset analyzes, measures and transforms the relationships between Fields in any tabular Dataset.

The TAB-dataset tool is part of the Environmental Sensing Project

For more information, see the user guide or the github repository.

What is TAB-dataset ?

Principles

In tabular data, columns and rows are not equivalent, the columns (or fields) represent the 'semantics' of the data and the rows represent the objects arranged according to the structure defined by the columns.

The TAB-dataset tool measures and analyzes relationships between fields via the TAB-analysis tool.

TAB-dataset uses relationships between fields to have an optimized JSON format (JSON-TAB format).

It also identifies data that does not respect given relationships.

Finally, it proposes transformations of the data set to respect a set of relationships.

TAB-dataset is used by ntv_pandas to identify consistency errors in DataFrame.

Examples

Here is a price list of different foods based on packaging.

plants	quantity	product	price
fruit	1 kg	apple	1
fruit	10 kg	apple	10
fruit	1 kg	orange	2
fruit	10 kg	orange	20
vegetable	1 kg	peppers	1.5
vegetable	10 kg	peppers	15
fruit	1 kg	banana	0.5
fruit	10 kg	banana	5

In this example, we observe two kinds of relationships:

classification ("derived" relationship): between 'plants' and 'product' (each product belongs a plant)
crossing ("crossed" relationship): between 'product' and 'quantity' (all the combinations of the two fields are present).

Another observation is that each record has a specific combination of 'product' and 'quantity', it will be possible to convert this dataset in matrix:

price	1 kg	10 kg
apple	1	10
orange	2	20
peppers	1.5	15
banana	0.5	5

In [1]: # creation of the `prices` object 
        from tab_dataset import Sdataset
        tabular = {'plants':   ['fruit', 'fruit','fruit',   'fruit','vegetable','vegetable','fruit',  'fruit' ],
                   'quantity': ['1 kg' , '10 kg', '1 kg',   '10 kg',  '1 kg',    '10 kg',   '1 kg',   '10 kg' ], 
                   'product':  ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'banana', 'banana'], 
                   'price':    [1,       10,      2,        20,       1.5,       15,        0.5,      5       ]}
        prices = Sdataset.ntv(tabular)

In [2]: # the `field_partition` method return the main structure of the dataset (see TAB-analysis)
        prices.field_partition(mode='id')
Out[2]: {'primary': ['quantity', 'product'],
         'secondary': ['plants'],
         'unique': [],
         'variable': ['price']}

In [4]: # we can send the data to tools supporting the identified data structure
        prices.to_xarray()
Out[4]: <xarray.DataArray 'price' (quantity: 2, product: 4)>
        array([[1, 2, 1.5, 0.5],
               [10, 20, 15, 5]], dtype=object)
        Coordinates:
        * quantity  (quantity) object '1 kg' '10 kg'
        * product   (product)  object 'apple' 'orange' 'peppers' 'banana'
          plants    (product)  object 'fruit' 'fruit' 'vegetable' 'fruit'

In [5]: # what if an error occurs ?
        tabul_2 = {'plants':   ['fruit', 'fruit','fruit', 'fruit','vegetable','vegetable','vegetable','fruit' ],
                   'quantity': ['1 kg' , '10 kg', '1 kg',   '10 kg',  '1 kg',    '10 kg',   '1 kg',   '10 kg' ], 
                   'product':  ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'banana', 'banana'], 
                   'price':    [1,       10,      2,        20,       1.5,       15,        0.5,      5       ]}
        prices = Sdataset.ntv(tabul_2)

In [6]: # the relationship is no more 'derived'
        prices.relation('plants', 'product').typecoupl
Out[6]: 'linked'

In [7]: # how much data is prohibited from being 'derived' ?
        prices.relation('plants', 'product').distomin
Out[7]: 1

In [8]: # What data needs to be corrected ?
        prices.check_relation('product', 'plants', 'derived', value=True)
Out[8]: {'row': [6, 7],
         'plants': ['vegetable', 'fruit'],
         'product': ['banana', 'banana']}

Dataset structure

To analyze the relationships between fields, a particular modeling is used:

each field is transformed into a list of distinct values and a list of pointers to these values
the analysis is then carried out on these lists of pointers

Example :

The field: ['john', 'anna', 'paul', 'anna', 'john', 'lisa'] is transformed into:

a first list of values ['john', 'anna', 'paul', ' lisa']

a second list of pointers: [0, 1, 2, 1, 0, 3].

We find for example this format in the 'categorical' data of pandas DataFrame.

JSON interface

TAB-dataset uses relationships between fields to have an optimized JSON format (JSON-TAB format).

In [9]: # the JSON length (equivalent to CSV length) is not optimized
        len(json.dumps(tabular))
Out[9]: 309

In [10]: # the JSON-TAB format is optimized
        len(json.dumps(prices.to_ntv().to_obj()))
Out[10]: 193

In [10]: prices.to_ntv().to_obj()
Out[10]: {'plants': [['fruit', 'vegetable'], 2, [0, 0, 1, 0]],
          'quantity': [['1 kg', '10 kg'], [1]],
          'product': [['apple', 'orange', 'peppers', 'banana'], [2]],
          'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5]}

In [11]: # the JSON-TAB format is reversible
         Sdataset.from_ntv(prices.to_ntv().to_obj()) == prices
Out[11]: True

Uses

TAB-dataset accepts pandas Dataframe, json data (NTV format) and simple structure like list of list or dict of list.

Possible uses are as follows:

control of a dataset in relation to a data model,
quality indicators of a dataset
analysis of datasets
error detection and correction,
generation of optimized data formats (alternative to CSV format)
interface to specific applications

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Jan 5, 2024

0.1.0

Nov 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tab_dataset-0.1.1.tar.gz (49.0 kB view details)

Uploaded Jan 5, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tab_dataset-0.1.1-py3-none-any.whl (37.1 kB view details)

Uploaded Jan 5, 2024 Python 3

File details

Details for the file tab_dataset-0.1.1.tar.gz.

File metadata

Download URL: tab_dataset-0.1.1.tar.gz
Upload date: Jan 5, 2024
Size: 49.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tab_dataset-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ad710a7c53e8bbf38f24a15b16031237d1dae52499c82d8dfa7834ec3f3eb243`
MD5	`fd972660e16b312bafa5ad2a43d46740`
BLAKE2b-256	`401c32e9f8c62ceedab8f271451e06bd0225eee3acf73517969a059d58156796`

See more details on using hashes here.

File details

Details for the file tab_dataset-0.1.1-py3-none-any.whl.

File metadata

Download URL: tab_dataset-0.1.1-py3-none-any.whl
Upload date: Jan 5, 2024
Size: 37.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for tab_dataset-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99004f1b50da1a1e9c63c2dbfc34dfd7165eaeeba2b40939e9753eb1002a4df2`
MD5	`b2f0bb1a975cbb9534a9eecdddbbb988`
BLAKE2b-256	`a772ad86b04a384a0c90121c8f695435ce3c525aec746599faa8c5193f77ce0e`

See more details on using hashes here.

tab-dataset 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TAB-dataset : A tool for structuring tabular data

What is TAB-dataset ?

Principles

Examples

Dataset structure

JSON interface

Uses

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes