Skip to main content

QALITA Platform Core lib for common function used in pack

Project description

QALITA Platform Core

QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).

Key features

  • Unified data access via a simple DataSource abstraction and factory
  • File, database, and object storage loaders with streaming to Parquet
  • Deterministic, size-bounded Parquet chunking with stable filenames
  • Safe Parquet writing for pandas DataFrames (automatic sanitization)
  • Shared aggregators for completeness, outliers, duplicates, and timeliness
  • Minimal pack runtime with JSON config loading and simple asset persistence

Supported sources

  • Files: CSV (.csv), Excel (.xlsx), JSON, Parquet (pass-through)
  • Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
  • Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via abfs), HDFS

Notes:

  • Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
  • SQLite is supported through the generic DatabaseSource when selected via type: "sqlite".

Installation

Prerequisites: Python 3.10–3.12 and Poetry.

Install dependencies and set your environment:

poetry env use python3.12 && \
poetry install && \
pip install --user poetry-plugin-export && \
poetry export -f requirements.txt --output requirements.txt --without-hashes && \
pip install -r requirements.txt

Open a Poetry shell when developing:

poetry shell

Quickstart

Use within a Pack

Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.

from qalita_core.pack import Pack

pack = Pack(configs={
    "pack_conf": "./pack_conf.json",
    "source_conf": "./source_conf.json",
    "target_conf": "./target_conf.json",
    "agent_file": "~/.qalita/.agent",
})

# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000

# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")

# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save()       # writes metrics.json
pack.recommendations.save()  # writes recommendations.json
pack.schemas.save()          # writes schemas.json

Parquet chunking and filenames

  • CSV/JSON/Excel are streamed with chunksize into multiple parquet files.
  • Databases are read with chunked SQL via SQLAlchemy/pandas.read_sql.
  • Filenames use a stable pattern: <source>_<object>_part_<k>.parquet where:
    • <source> is a slug of the source type (e.g. file, sqlite, postgresql).
    • <object> is a slug of the table name, query label, or file stem.
    • Example: file_testdata_part_1.parquet, sqlite_items_part_3.parquet, sqlite_query_part_2.parquet.

Configure output and size via pack_config:

  • parquet_output_dir (default: ./parquet)
  • chunk_rows (default: 100000)
  • Optional job.source.skiprows applied to CSV/Excel

Safe Parquet writing for pandas

On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:

  • Ensures column names are strings
  • Decodes bytes to UTF‑8 strings when present
  • Normalizes mixed-type object columns and categoricals
  • Defaults to engine="pyarrow"

You can also call the sanitizer explicitly:

from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)

Aggregation helpers (for packs)

Helpers centralize common result/metric aggregation logic:

from qalita_core import (
    detect_chunked_from_items,
    normalize_and_dedupe_recommendations,
    CompletenessAggregator,
    OutlierAggregator,
    DuplicateAggregator,
    TimelinessAggregator,
)
  • CompletenessAggregator: column/dataset completeness and schema extraction
  • OutlierAggregator: per-column and dataset outlier/normality metrics
  • DuplicateAggregator: duplicate counts and dataset-level score using key columns
  • TimelinessAggregator: dates/years coverage and recency scoring

Development

  • Tests: poetry run pytest
  • Formatting: poetry run black .
  • Linting: poetry run flake8 and poetry run pylint <module>
  • Editable install while debugging:
poetry shell
pip install --editable .

Documentation

Additional material can be found in the online documentation: https://doc.qalita.io/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qalita_core-1.2.4.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qalita_core-1.2.4-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file qalita_core-1.2.4.tar.gz.

File metadata

  • Download URL: qalita_core-1.2.4.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.7 Linux/6.11.0-1018-azure

File hashes

Hashes for qalita_core-1.2.4.tar.gz
Algorithm Hash digest
SHA256 30adeaedaec4386d17454a49be3c74caaf3a28d99f622e46fe7973c2618151bb
MD5 1dedb0074aa0af5e4cd51a38bbc0a410
BLAKE2b-256 7d34bf1e68cbc405e267eddfe5f88ca6a8c6c5d8fb94df7339caae60092440b2

See more details on using hashes here.

File details

Details for the file qalita_core-1.2.4-py3-none-any.whl.

File metadata

  • Download URL: qalita_core-1.2.4-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.7 Linux/6.11.0-1018-azure

File hashes

Hashes for qalita_core-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 22a53f3df8029f13bdd8c1b285cb294150c0de83e4a6d3011af9002d8759e60b
MD5 49f0659ff42fed21711a16c9a274cdae
BLAKE2b-256 406d2020598b2f543241e54452d2e39f17c11749d4a1dba1d715660924e5cd54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page