Skip to main content

QALITA Platform Core lib for common function used in pack

Project description

QALITA Core

QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).

Key features

  • Unified data access via a simple DataSource abstraction and factory
  • File, database, and object storage loaders with streaming to Parquet
  • Deterministic, size-bounded Parquet chunking with stable filenames
  • Safe Parquet writing for pandas DataFrames (automatic sanitization)
  • Shared aggregators for completeness, outliers, duplicates, and timeliness
  • Minimal pack runtime with JSON config loading and simple asset persistence

Supported sources

  • Files: CSV (.csv), Excel (.xlsx), JSON, Parquet (pass-through)
  • Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
  • Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via abfs), HDFS

Notes:

  • Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
  • SQLite is supported through the generic DatabaseSource when selected via type: "sqlite".

Installation

Prerequisites: Python 3.10–3.12 and uv.

Install dependencies and set your environment:

pip install uv
uv sync

Open a uv shell when developing:

uv shell

Quickstart

Use within a Pack

Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.

from qalita_core.pack import Pack

pack = Pack(configs={
    "pack_conf": "./pack_conf.json",
    "source_conf": "./source_conf.json",
    "target_conf": "./target_conf.json",
    "agent_file": "~/.qalita/.worker",
})

# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000

# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")

# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save()       # writes metrics.json
pack.recommendations.save()  # writes recommendations.json
pack.schemas.save()          # writes schemas.json

Parquet chunking and filenames

  • CSV/JSON/Excel are streamed with chunksize into multiple parquet files.
  • Databases are read with chunked SQL via SQLAlchemy/pandas.read_sql.
  • Filenames use a stable pattern: <source>_<object>_part_<k>.parquet where:
    • <source> is a slug of the source type (e.g. file, sqlite, postgresql).
    • <object> is a slug of the table name, query label, or file stem.
    • Example: file_testdata_part_1.parquet, sqlite_items_part_3.parquet, sqlite_query_part_2.parquet.

Configure output and size via pack_config:

  • parquet_output_dir (default: ./parquet)
  • chunk_rows (default: 100000)
  • Optional job.source.skiprows applied to CSV/Excel

Safe Parquet writing for pandas

On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:

  • Ensures column names are strings
  • Decodes bytes to UTF‑8 strings when present
  • Normalizes mixed-type object columns and categoricals
  • Defaults to engine="pyarrow"

You can also call the sanitizer explicitly:

from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)

Aggregation helpers (for packs)

Helpers centralize common result/metric aggregation logic:

from qalita_core import (
    detect_chunked_from_items,
    normalize_and_dedupe_recommendations,
    CompletenessAggregator,
    OutlierAggregator,
    DuplicateAggregator,
    TimelinessAggregator,
)
  • CompletenessAggregator: column/dataset completeness and schema extraction
  • OutlierAggregator: per-column and dataset outlier/normality metrics
  • DuplicateAggregator: duplicate counts and dataset-level score using key columns
  • TimelinessAggregator: dates/years coverage and recency scoring

Development

  • Tests: uv run pytest
  • Formatting: uv run black .
  • Linting: uv run flake8 and uv run pylint <module>
  • Editable install while debugging:
uv sync
uv pip install -e .

Documentation

Additional material can be found in the online documentation: https://doc.qalita.io/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qalita_core-1.4.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qalita_core-1.4.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file qalita_core-1.4.0.tar.gz.

File metadata

  • Download URL: qalita_core-1.4.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.4.0.tar.gz
Algorithm Hash digest
SHA256 7d2b629f259efb577b6430e8c0db378434f38efb544a54fed3ccbf2309f8ab34
MD5 0a78f00a3e8dbd4d8e23eda819c6c8f3
BLAKE2b-256 3aef39fcd8d1ebf2c36a2c5154ab1900fc8ff6fca97927adb06d5436397f7a88

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.4.0.tar.gz:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qalita_core-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: qalita_core-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalita_core-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 883117623cb439b1492b4750c424fa9838f430b99ac5b5d3ead34a8df5e9001f
MD5 8d2f6c0161f9ce034a18cc854d2d13f5
BLAKE2b-256 56326a845c49c0d63bbf52a036509ba379bd88a9e65e41862a2418a4c59056e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalita_core-1.4.0-py3-none-any.whl:

Publisher: ci.yml on qalita-io/qalita-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page