Skip to main content

QALITA Platform Core lib for common function used in pack

Project description

QALITA Core

QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).

Key features

  • Unified data access via a simple DataSource abstraction and factory
  • File, database, and object storage loaders with streaming to Parquet
  • Deterministic, size-bounded Parquet chunking with stable filenames
  • Safe Parquet writing for pandas DataFrames (automatic sanitization)
  • Shared aggregators for completeness, outliers, duplicates, and timeliness
  • Minimal pack runtime with JSON config loading and simple asset persistence

Supported sources

  • Files: CSV (.csv), Excel (.xlsx), JSON, Parquet (pass-through)
  • Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
  • Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via abfs), HDFS

Notes:

  • Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
  • SQLite is supported through the generic DatabaseSource when selected via type: "sqlite".

Installation

Prerequisites: Python 3.10–3.12 and uv.

Install dependencies and set your environment:

pip install uv
uv sync

Open a uv shell when developing:

uv shell

Quickstart

Use within a Pack

Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.

from qalita_core.pack import Pack

pack = Pack(configs={
    "pack_conf": "./pack_conf.json",
    "source_conf": "./source_conf.json",
    "target_conf": "./target_conf.json",
    "agent_file": "~/.qalita/.worker",
})

# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000

# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")

# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save()       # writes metrics.json
pack.recommendations.save()  # writes recommendations.json
pack.schemas.save()          # writes schemas.json

Parquet chunking and filenames

  • CSV/JSON/Excel are streamed with chunksize into multiple parquet files.
  • Databases are read with chunked SQL via SQLAlchemy/pandas.read_sql.
  • Filenames use a stable pattern: <source>_<object>_part_<k>.parquet where:
    • <source> is a slug of the source type (e.g. file, sqlite, postgresql).
    • <object> is a slug of the table name, query label, or file stem.
    • Example: file_testdata_part_1.parquet, sqlite_items_part_3.parquet, sqlite_query_part_2.parquet.

Configure output and size via pack_config:

  • parquet_output_dir (default: ./parquet)
  • chunk_rows (default: 100000)
  • Optional job.source.skiprows applied to CSV/Excel

Safe Parquet writing for pandas

On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:

  • Ensures column names are strings
  • Decodes bytes to UTF‑8 strings when present
  • Normalizes mixed-type object columns and categoricals
  • Defaults to engine="pyarrow"

You can also call the sanitizer explicitly:

from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)

Aggregation helpers (for packs)

Helpers centralize common result/metric aggregation logic:

from qalita_core import (
    detect_chunked_from_items,
    normalize_and_dedupe_recommendations,
    CompletenessAggregator,
    OutlierAggregator,
    DuplicateAggregator,
    TimelinessAggregator,
)
  • CompletenessAggregator: column/dataset completeness and schema extraction
  • OutlierAggregator: per-column and dataset outlier/normality metrics
  • DuplicateAggregator: duplicate counts and dataset-level score using key columns
  • TimelinessAggregator: dates/years coverage and recency scoring

Development

  • Tests: uv run pytest
  • Formatting: uv run black .
  • Linting: uv run flake8 and uv run pylint <module>
  • Editable install while debugging:
uv sync
uv pip install -e .

Documentation

Additional material can be found in the online documentation: https://doc.qalita.io/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qalita_core-1.5.2.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qalita_core-1.5.2-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file qalita_core-1.5.2.tar.gz.

File metadata

  • Download URL: qalita_core-1.5.2.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.5.2.tar.gz
Algorithm Hash digest
SHA256 47bd8fc0a64c3d8672a1fb6fdc8ae0c295b7319cddc6c86220063278de0bf6e7
MD5 9adb3f2d20bf48ccd633c78ccbda48c9
BLAKE2b-256 55d723b74283cb3486f2842d0e5d3c0b0687b78c0a6285f5a2577dddc113df58

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.5.2.tar.gz:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qalita_core-1.5.2-py3-none-any.whl.

File metadata

  • Download URL: qalita_core-1.5.2-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 610a2a52f1796b2ee7186201169c0cca9c9a5aa0cd0de0ee1b6cb4d6eba70c77
MD5 54a625cfd9d4303c3f84107628009510
BLAKE2b-256 c11af5d8012838cca5c4cfc3c312f4d0eafe092a78aa61a4834897959df9ab4e

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.5.2-py3-none-any.whl:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page