QALITA Platform Core lib for common function used in pack
Project description
QALITA Core
QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).
Key features
- Unified data access via a simple
DataSourceabstraction and factory - File, database, and object storage loaders with streaming to Parquet
- Deterministic, size-bounded Parquet chunking with stable filenames
- Safe Parquet writing for pandas DataFrames (automatic sanitization)
- Shared aggregators for completeness, outliers, duplicates, and timeliness
- Minimal pack runtime with JSON config loading and simple asset persistence
Supported sources
- Files: CSV (
.csv), Excel (.xlsx), JSON, Parquet (pass-through) - Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
- Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via
abfs), HDFS
Notes:
- Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
- SQLite is supported through the generic
DatabaseSourcewhen selected viatype: "sqlite".
Installation
Prerequisites: Python 3.10–3.12 and uv.
Install dependencies and set your environment:
pip install uv
uv sync
Open a uv shell when developing:
uv shell
Quickstart
Use within a Pack
Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.
from qalita_core.pack import Pack
pack = Pack(configs={
"pack_conf": "./pack_conf.json",
"source_conf": "./source_conf.json",
"target_conf": "./target_conf.json",
"agent_file": "~/.qalita/.worker",
})
# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000
# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")
# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save() # writes metrics.json
pack.recommendations.save() # writes recommendations.json
pack.schemas.save() # writes schemas.json
Parquet chunking and filenames
- CSV/JSON/Excel are streamed with
chunksizeinto multiple parquet files. - Databases are read with chunked SQL via SQLAlchemy/
pandas.read_sql. - Filenames use a stable pattern:
<source>_<object>_part_<k>.parquetwhere:<source>is a slug of the source type (e.g.file,sqlite,postgresql).<object>is a slug of the table name, query label, or file stem.- Example:
file_testdata_part_1.parquet,sqlite_items_part_3.parquet,sqlite_query_part_2.parquet.
Configure output and size via pack_config:
parquet_output_dir(default:./parquet)chunk_rows(default:100000)- Optional
job.source.skiprowsapplied to CSV/Excel
Safe Parquet writing for pandas
On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:
- Ensures column names are strings
- Decodes bytes to UTF‑8 strings when present
- Normalizes mixed-type object columns and categoricals
- Defaults to
engine="pyarrow"
You can also call the sanitizer explicitly:
from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)
Aggregation helpers (for packs)
Helpers centralize common result/metric aggregation logic:
from qalita_core import (
detect_chunked_from_items,
normalize_and_dedupe_recommendations,
CompletenessAggregator,
OutlierAggregator,
DuplicateAggregator,
TimelinessAggregator,
)
CompletenessAggregator: column/dataset completeness and schema extractionOutlierAggregator: per-column and dataset outlier/normality metricsDuplicateAggregator: duplicate counts and dataset-level score using key columnsTimelinessAggregator: dates/years coverage and recency scoring
Development
- Tests:
uv run pytest - Formatting:
uv run black . - Linting:
uv run flake8anduv run pylint <module> - Editable install while debugging:
uv sync
uv pip install -e .
Documentation
Additional material can be found in the online documentation: https://doc.qalita.io/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qalita_core-1.4.0.tar.gz.
File metadata
- Download URL: qalita_core-1.4.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d2b629f259efb577b6430e8c0db378434f38efb544a54fed3ccbf2309f8ab34
|
|
| MD5 |
0a78f00a3e8dbd4d8e23eda819c6c8f3
|
|
| BLAKE2b-256 |
3aef39fcd8d1ebf2c36a2c5154ab1900fc8ff6fca97927adb06d5436397f7a88
|
Provenance
The following attestation bundles were made for qalita_core-1.4.0.tar.gz:
Publisher:
ci.yml on qalita-io/qalita-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qalita_core-1.4.0.tar.gz -
Subject digest:
7d2b629f259efb577b6430e8c0db378434f38efb544a54fed3ccbf2309f8ab34 - Sigstore transparency entry: 821976553
- Sigstore integration time:
-
Permalink:
qalita-io/qalita-core@08ab167ab3098c725995c7953cbd1eec020670cb -
Branch / Tag:
refs/tags/1.4.0 - Owner: https://github.com/qalita-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@08ab167ab3098c725995c7953cbd1eec020670cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file qalita_core-1.4.0-py3-none-any.whl.
File metadata
- Download URL: qalita_core-1.4.0-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
883117623cb439b1492b4750c424fa9838f430b99ac5b5d3ead34a8df5e9001f
|
|
| MD5 |
8d2f6c0161f9ce034a18cc854d2d13f5
|
|
| BLAKE2b-256 |
56326a845c49c0d63bbf52a036509ba379bd88a9e65e41862a2418a4c59056e1
|
Provenance
The following attestation bundles were made for qalita_core-1.4.0-py3-none-any.whl:
Publisher:
ci.yml on qalita-io/qalita-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qalita_core-1.4.0-py3-none-any.whl -
Subject digest:
883117623cb439b1492b4750c424fa9838f430b99ac5b5d3ead34a8df5e9001f - Sigstore transparency entry: 821976609
- Sigstore integration time:
-
Permalink:
qalita-io/qalita-core@08ab167ab3098c725995c7953cbd1eec020670cb -
Branch / Tag:
refs/tags/1.4.0 - Owner: https://github.com/qalita-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@08ab167ab3098c725995c7953cbd1eec020670cb -
Trigger Event:
push
-
Statement type: