Skip to main content

A python package for validating expectations of text data, and safely reporting exceptions

Project description

Are You My Data?

img

This python package takes on the challenges of of transmitting data in text format.

  1. Provides a simple, and expressive framework to define a data set
  2. Generates explicit documentation on the dataset for communication with others
  3. Validates files against definition, providing detailed messages about violations without exposing any information about the actual data in the file

This package provides all of this with the don't repeat yourself (DRY) principle at its core.

  • the code that defines the data, documents the data
  • the code that defines the data, validates the data

Example

For this example, we'll pretend that Alice needs Bob to send her data.

Alice will start by defining a layout. In the dataset, she wants a Text column, a Choice column, and an Integer. She'll define that below, then generate a digest of the layout so that she can share it with Bob. She accomplishes that with the code below.

from rumydata import Layout
from rumydata.cell import Text, Choice, Integer
layout = Layout(definition={
    'col1': Text(8),
    'col2': Choice(['x', 'y', 'z'], nullable=True),
    'col3': Integer(1)
})
print(layout.markdown_digest())

As you can see in the digest output below, there is a great deal of explicit detail. This is to the benefit of Bob, who needs to extract data from his source systems and conform it to Alice's expectations.

This demonstrates a key concept of this package; the code that defines the data, documents the data. This makes Alice's job easier, but also helps to prevent miscommunication and misunderstanding that occurs when Alice documents the expectation separately from the actual code.

- **col1**
   - Type: String
   - Max Length: 8 characters
   - cannot be empty/blank
   - must be no more than 8 characters
 - **col2**
   - Type: Choice
   - Choices: x,y,z
   - must be one of ['x', 'y', 'z']
   - Nullable
 - **col3**
   - Type: Numeric
   - Format: 9
   - Max Length: 1 digits
   - cannot be empty/blank
   - can be coerced into an integer value
   - cannot have a leading zero digit
   - must have no more than 1 digits after removing other characters

In our example, Alice sends the documentation to Bob, who then performs an extract of the data from his system. Bob thinks he's followed the documentation exactly as described, but he's actually made a mistake.

col1 col2 col3
abc x -1
def 0
ghi a 1

Bob sends the data to Alice, who then validates it using her layout. Another key concept of this package is demonstrated in this step; the code that defines the data, validates the data.

from rumydata import Layout
from rumydata.cell import Text, Choice, Integer
layout = Layout(definition={
    'col1': Text(8),
    'col2': Choice(['x', 'y', 'z'], nullable=True),
    'col3': Integer(1)
})
layout.check_file(f'bobs_data.csv')

When Alice checks the file for validity, she receives the following message:

AssertionError: 
 - File: None
   - Row: 4
     - Cell: 4,2 (col2)
       - InvalidChoice: must be one of ['x', 'y', 'z']

The layout has detected that the second value of the fourth row does not meet the defined expectations, and it has provided a detailed message explaining what was expected. It is important to note: this error message does not describe the value that was provided, it only describes what was expected, and where in the data that expectation was violated. This is an intentional design of this package, as it lets Alice freely communicate with Bob about the issues in the data, with little risk of exposing the data itself.

Alice sends the message to Bob, and with it he's able to easily see that the value her provided was not one of the valid choices. He can also refer back to the definition digest, and see that col2 is nullable, and that he can send a blank value instead of the invalid value that he sent.

Errors

Add error raising content.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rumydata-0.0.6.tar.gz (11.4 kB view hashes)

Uploaded Source

Built Distribution

rumydata-0.0.6-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page