QALITA Platform Core lib for common function used in pack
Project description
QALITA Platform Core
QALITA Core is a lightweight helper library used by QALITA packs to load data from multiple sources, materialize them to Parquet in deterministic chunks, and share common utilities (sanitization and aggregation helpers).
Key features
- Unified data access via a simple
DataSourceabstraction and factory - File, database, and object storage loaders with streaming to Parquet
- Deterministic, size-bounded Parquet chunking with stable filenames
- Safe Parquet writing for pandas DataFrames (automatic sanitization)
- Shared aggregators for completeness, outliers, duplicates, and timeliness
- Minimal pack runtime with JSON config loading and simple asset persistence
Supported sources
- Files: CSV (
.csv), Excel (.xlsx), JSON, Parquet (pass-through) - Databases: PostgreSQL, MySQL, Oracle, MS SQL Server, SQLite
- Object storage: Amazon S3, Google Cloud Storage, Azure Blob (via
abfs), HDFS
Notes:
- Folder, MongoDB classes exist as placeholders; MongoDB is not yet implemented.
- SQLite is supported through the generic
DatabaseSourcewhen selected viatype: "sqlite".
Installation
Prerequisites: Python 3.10–3.12 and Poetry.
Install dependencies and set your environment:
poetry env use python3.12 && \
poetry install && \
pip install --user poetry-plugin-export && \
poetry export -f requirements.txt --output requirements.txt --without-hashes && \
pip install -r requirements.txt
Open a Poetry shell when developing:
poetry shell
Quickstart
Use within a Pack
Pack loads four JSON files by default (overridable) and provides load_data() for source or target triggers.
from qalita_core.pack import Pack
pack = Pack(configs={
"pack_conf": "./pack_conf.json",
"source_conf": "./source_conf.json",
"target_conf": "./target_conf.json",
"agent_file": "~/.qalita/.agent",
})
# Ensure chunking/output are set (can be in pack_conf["job"] too)
pack.pack_config.setdefault("job", {})
pack.pack_config["job"]["parquet_output_dir"] = "./parquet"
pack.pack_config["job"]["chunk_rows"] = 100_000
# Load source
source_paths = pack.load_data("source")
# Load target (optional)
target_paths = pack.load_data("target")
# Persist custom metrics/recommendations/schemas to JSON files
pack.metrics.data.append({"key": "score", "value": "0.95", "scope": {"perimeter": "dataset", "value": "my_dataset"}})
pack.metrics.save() # writes metrics.json
pack.recommendations.save() # writes recommendations.json
pack.schemas.save() # writes schemas.json
Parquet chunking and filenames
- CSV/JSON/Excel are streamed with
chunksizeinto multiple parquet files. - Databases are read with chunked SQL via SQLAlchemy/
pandas.read_sql. - Filenames use a stable pattern:
<source>_<object>_part_<k>.parquetwhere:<source>is a slug of the source type (e.g.file,sqlite,postgresql).<object>is a slug of the table name, query label, or file stem.- Example:
file_testdata_part_1.parquet,sqlite_items_part_3.parquet,sqlite_query_part_2.parquet.
Configure output and size via pack_config:
parquet_output_dir(default:./parquet)chunk_rows(default:100000)- Optional
job.source.skiprowsapplied to CSV/Excel
Safe Parquet writing for pandas
On import, QALITA Core installs a small monkeypatch so DataFrame.to_parquet:
- Ensures column names are strings
- Decodes bytes to UTF‑8 strings when present
- Normalizes mixed-type object columns and categoricals
- Defaults to
engine="pyarrow"
You can also call the sanitizer explicitly:
from qalita_core import sanitize_dataframe_for_parquet
clean_df = sanitize_dataframe_for_parquet(df)
Aggregation helpers (for packs)
Helpers centralize common result/metric aggregation logic:
from qalita_core import (
detect_chunked_from_items,
normalize_and_dedupe_recommendations,
CompletenessAggregator,
OutlierAggregator,
DuplicateAggregator,
TimelinessAggregator,
)
CompletenessAggregator: column/dataset completeness and schema extractionOutlierAggregator: per-column and dataset outlier/normality metricsDuplicateAggregator: duplicate counts and dataset-level score using key columnsTimelinessAggregator: dates/years coverage and recency scoring
Development
- Tests:
poetry run pytest - Formatting:
poetry run black . - Linting:
poetry run flake8andpoetry run pylint <module> - Editable install while debugging:
poetry shell
pip install --editable .
Documentation
Additional material can be found in the online documentation: https://doc.qalita.io/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qalita_core-1.2.4.tar.gz.
File metadata
- Download URL: qalita_core-1.2.4.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.13.7 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30adeaedaec4386d17454a49be3c74caaf3a28d99f622e46fe7973c2618151bb
|
|
| MD5 |
1dedb0074aa0af5e4cd51a38bbc0a410
|
|
| BLAKE2b-256 |
7d34bf1e68cbc405e267eddfe5f88ca6a8c6c5d8fb94df7339caae60092440b2
|
File details
Details for the file qalita_core-1.2.4-py3-none-any.whl.
File metadata
- Download URL: qalita_core-1.2.4-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.13.7 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22a53f3df8029f13bdd8c1b285cb294150c0de83e4a6d3011af9002d8759e60b
|
|
| MD5 |
49f0659ff42fed21711a16c9a274cdae
|
|
| BLAKE2b-256 |
406d2020598b2f543241e54452d2e39f17c11749d4a1dba1d715660924e5cd54
|