Skip to main content

QALITA Platform Core lib for common function used in pack

Project description

QALITA Core

QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).

Key features

  • Unified data access via a simple DataSource abstraction and factory
  • File, database, and object storage loaders with streaming to Parquet
  • Deterministic, size-bounded Parquet chunking with stable filenames
  • Safe Parquet writing for pandas DataFrames (automatic sanitization)
  • Shared aggregators for completeness, outliers, duplicates, and timeliness
  • Minimal pack runtime with JSON config loading and simple asset persistence

Supported sources

  • Files: CSV (.csv), Excel (.xlsx), JSON, Parquet (pass-through)
  • Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
  • Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via abfs), HDFS

Notes:

  • Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
  • SQLite is supported through the generic DatabaseSource when selected via type: "sqlite".

Installation

Prerequisites: Python 3.10–3.12 and uv.

Install dependencies and set your environment:

pip install uv
uv sync

Open a uv shell when developing:

uv shell

Quickstart

Use within a Pack

Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.

from qalita_core.pack import Pack

pack = Pack(configs={
    "pack_conf": "./pack_conf.json",
    "source_conf": "./source_conf.json",
    "target_conf": "./target_conf.json",
    "agent_file": "~/.qalita/.worker",
})

# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000

# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")

# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save()       # writes metrics.json
pack.recommendations.save()  # writes recommendations.json
pack.schemas.save()          # writes schemas.json

Parquet chunking and filenames

  • CSV/JSON/Excel are streamed with chunksize into multiple parquet files.
  • Databases are read with chunked SQL via SQLAlchemy/pandas.read_sql.
  • Filenames use a stable pattern: <source>_<object>_part_<k>.parquet where:
    • <source> is a slug of the source type (e.g. file, sqlite, postgresql).
    • <object> is a slug of the table name, query label, or file stem.
    • Example: file_testdata_part_1.parquet, sqlite_items_part_3.parquet, sqlite_query_part_2.parquet.

Configure output and size via pack_config:

  • parquet_output_dir (default: ./parquet)
  • chunk_rows (default: 100000)
  • Optional job.source.skiprows applied to CSV/Excel

Safe Parquet writing for pandas

On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:

  • Ensures column names are strings
  • Decodes bytes to UTF‑8 strings when present
  • Normalizes mixed-type object columns and categoricals
  • Defaults to engine="pyarrow"

You can also call the sanitizer explicitly:

from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)

Aggregation helpers (for packs)

Helpers centralize common result/metric aggregation logic:

from qalita_core import (
    detect_chunked_from_items,
    normalize_and_dedupe_recommendations,
    CompletenessAggregator,
    OutlierAggregator,
    DuplicateAggregator,
    TimelinessAggregator,
)
  • CompletenessAggregator: column/dataset completeness and schema extraction
  • OutlierAggregator: per-column and dataset outlier/normality metrics
  • DuplicateAggregator: duplicate counts and dataset-level score using key columns
  • TimelinessAggregator: dates/years coverage and recency scoring

Development

  • Tests: uv run pytest
  • Formatting: uv run black .
  • Linting: uv run flake8 and uv run pylint <module>
  • Editable install while debugging:
uv sync
uv pip install -e .

Documentation

Additional material can be found in the online documentation: https://doc.qalita.io/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qalita_core-1.4.1.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qalita_core-1.4.1-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file qalita_core-1.4.1.tar.gz.

File metadata

  • Download URL: qalita_core-1.4.1.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.4.1.tar.gz
Algorithm Hash digest
SHA256 a305ab05d47ce9808e19c748091dcb65a86a1d51a6785cc1b9be971dd51b4994
MD5 a9d5dd77392f4089c96a1880b2f10baf
BLAKE2b-256 71b32e8f777aea76f045c5a6a5b7729229873963cde610bcce4d6eacff89baee

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.4.1.tar.gz:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qalita_core-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: qalita_core-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 429dfdfd6955a42fa9c66357775bb7901d792d4187dd6808f612984079948a4a
MD5 51c9db6708629135624e250e32e6d1cd
BLAKE2b-256 4bb4e5aa61275e631ae49f8d7995ff735b03b8fe50be8bd5caf56e5f790f0317

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.4.1-py3-none-any.whl:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page