Skip to main content

QALITA Platform Core lib for common function used in pack

Project description

QALITA Core

QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).

Key features

  • Unified data access via a simple DataSource abstraction and factory
  • File, database, and object storage loaders with streaming to Parquet
  • Deterministic, size-bounded Parquet chunking with stable filenames
  • Safe Parquet writing for pandas DataFrames (automatic sanitization)
  • Shared aggregators for completeness, outliers, duplicates, and timeliness
  • Minimal pack runtime with JSON config loading and simple asset persistence

Supported sources

  • Files: CSV (.csv), Excel (.xlsx), JSON, Parquet (pass-through)
  • Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
  • Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via abfs), HDFS

Notes:

  • Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
  • SQLite is supported through the generic DatabaseSource when selected via type: "sqlite".

Installation

Prerequisites: Python 3.10–3.12 and uv.

Install dependencies and set your environment:

pip install uv
uv sync

Open a uv shell when developing:

uv shell

Quickstart

Use within a Pack

Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.

from qalita_core.pack import Pack

pack = Pack(configs={
    "pack_conf": "./pack_conf.json",
    "source_conf": "./source_conf.json",
    "target_conf": "./target_conf.json",
    "agent_file": "~/.qalita/.worker",
})

# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000

# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")

# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save()       # writes metrics.json
pack.recommendations.save()  # writes recommendations.json
pack.schemas.save()          # writes schemas.json

Parquet chunking and filenames

  • CSV/JSON/Excel are streamed with chunksize into multiple parquet files.
  • Databases are read with chunked SQL via SQLAlchemy/pandas.read_sql.
  • Filenames use a stable pattern: <source>_<object>_part_<k>.parquet where:
    • <source> is a slug of the source type (e.g. file, sqlite, postgresql).
    • <object> is a slug of the table name, query label, or file stem.
    • Example: file_testdata_part_1.parquet, sqlite_items_part_3.parquet, sqlite_query_part_2.parquet.

Configure output and size via pack_config:

  • parquet_output_dir (default: ./parquet)
  • chunk_rows (default: 100000)
  • Optional job.source.skiprows applied to CSV/Excel

Safe Parquet writing for pandas

On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:

  • Ensures column names are strings
  • Decodes bytes to UTF‑8 strings when present
  • Normalizes mixed-type object columns and categoricals
  • Defaults to engine="pyarrow"

You can also call the sanitizer explicitly:

from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)

Aggregation helpers (for packs)

Helpers centralize common result/metric aggregation logic:

from qalita_core import (
    detect_chunked_from_items,
    normalize_and_dedupe_recommendations,
    CompletenessAggregator,
    OutlierAggregator,
    DuplicateAggregator,
    TimelinessAggregator,
)
  • CompletenessAggregator: column/dataset completeness and schema extraction
  • OutlierAggregator: per-column and dataset outlier/normality metrics
  • DuplicateAggregator: duplicate counts and dataset-level score using key columns
  • TimelinessAggregator: dates/years coverage and recency scoring

Development

  • Tests: uv run pytest
  • Formatting: uv run black .
  • Linting: uv run flake8 and uv run pylint <module>
  • Editable install while debugging:
uv sync
uv pip install -e .

Documentation

Additional material can be found in the online documentation: https://doc.qalita.io/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qalita_core-1.5.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qalita_core-1.5.1-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file qalita_core-1.5.1.tar.gz.

File metadata

  • Download URL: qalita_core-1.5.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.5.1.tar.gz
Algorithm Hash digest
SHA256 c32708bff20ec358cf94931c40681d1c7b129eb3617d65ea7959e714d9dc2b82
MD5 74dccad53dd177c9e5bcdee34c4ba821
BLAKE2b-256 e8d202a8a11a7f6d7ca16ffe787e1d6a64d8033091e6069c6f639a24ab58f64b

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.5.1.tar.gz:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qalita_core-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: qalita_core-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d1d725ff67e34e6a812f030db4078d30964e7554eec39bee9b05371fd323e4bd
MD5 01636919450ec4d469a8df3473125bab
BLAKE2b-256 51060e9e97841cb662e1dc9700d26ea7be5df73d9a1660f40bfe46d333e33c7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.5.1-py3-none-any.whl:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page