Skip to main content

Add your description here

Project description

atollas 🪼

A column level typed runner for pandas workflows.

⚠️ Warning

This is still very pre-alpha and experimental, have fun experimenting with it, but it isn't ready for production workflows just yet.

Motivation

Pandas is an amazing tool for interactive data analysis - but that's a little at odds with reliability in a production setting. Atollas is designed to be a type checking container for pandas workflows, to allow greater reliability in production.

Let's work through this series of operations and consider things that could possibly throw run-time errors:

import pandas as pd

spot_the_fail = (
  pd.read_csv("some_file.csv")  # <- file doesn't exist (obviously)
  .groupby(["col_1", "col_2"]) # <- columns aren't in dataframe
  .agg({"col_3": "sum"}) # <- column not in dataframe, or not of summable type
  .merge(
    pd.read_parquet("another_file.parquet"),  # <- file doesn't exist
    on="col_1",  # <- col_1 not of matching type
  )
  [["col_1", "col_2", "col_4"]]  # <- col_4 not resent in joined dataframe
                                 # <- col_2 also present in joined dataframe
                                 #    so now col_2_x and col_2_y exist but not
                                 #    col_2
)

That's also not factoring in items that might confuse business logic or otherwise cause problems, for instance if col_1 has meant to be unique across both dataframes, but has duplicates in one or both, records would be duplicated.

Essentially, all these errors are problems with the data not matching our implicit assumptions. This is a natural consequence of the fact that pandas is designed for exploration and experimentation, and not column level type safety.

Atollas is a column-level type safe runner for pandas, the above code using Atollas would look like this:

import atollas as atl
from atollas import unique, integer, string, double, datetime_tz
from atollas.aggregations import Sum

spot_the_fail = (
  atl.read_csv(
    "some_file.csv",
    schema={"col_1": unique(integer), "col_2": string, "col_3": double},
  )  # <- specified schema is incorrect
  .aggregate(
    by=["col_1", "col_2"],
    col_3=Sum("col_3"),
  )
  .merge(
    atl.read_parquet(
       "another_file.parquet",
       schema={"col_1": unique(integer), "col_4": datetime_tz("UTC")},
    ),  # <- specified schema is incorrect
    cardinality="one-to-one",
    on="col_1",
  )
  .filter_columns(["col_1", "col_2", "col_3"])
)

We have to bear some additional boilerplate, but our code is now a lot more predictable, huzzah! Note that IO operations are the only ones that can fail indeterminately now. This is the core idea behind atollas: If something runs once, it should run everytime, only throwing errors at io boundaries.

This is a massive benefit for production workflows, because all errors are reduced to something like:

atollas.read_csv raise TypeError:
  file "some_file.csv" did not match specified type,
  "col_1" is meant to be unique, but has duplicate records

Which means we can track down issues a lot faster.

Atollas moniters types at read time, and additionally will raise errors when operations are incorrect for the specified schema. For instance, if merging with one-to-one cardinality, both sides of the merge operation must be unique (i.e. a unique(integer) type rather than simply integer). This helps you catch errors you didn't even think of at development time, neato-burrito!

Comparison to Pandera

The most similar project (at least that I know about) is pandera - pandera is a really awesome, and much more developed project that atollas. It's probably the most established tool for schema enforcement of type level dataframes, but is a little different in aims from Atollas.

pandera gives you decorators and function to check schemas at various points, but your code is still essentially pandas code. Whilst atollas tries to provide a wrapper for additional column level type safety.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atollas-0.0.3.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

atollas-0.0.3-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file atollas-0.0.3.tar.gz.

File metadata

  • Download URL: atollas-0.0.3.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for atollas-0.0.3.tar.gz
Algorithm Hash digest
SHA256 36fb0337ce9bed552cd7ca7b777f4829db368b4783bd5175de8bf3968758cb81
MD5 34aac3ab6e7088e3d086acd94fcb9d2e
BLAKE2b-256 f356a0f6e2ba398f652780b8ae67d174de6af976ec4be309e90c745e4bc3764d

See more details on using hashes here.

File details

Details for the file atollas-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: atollas-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for atollas-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f7aecd91374f405fdc3d51022150a47817243c5b7018176b3119f2f87a50bcca
MD5 960c53582f6556af51688f25504646fe
BLAKE2b-256 de67dfd555536bc50bbfbfe8467b2f4f00cf62548c07ffb0074d9103447afcee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page