AI-Powered Canonical Mapping Engine — map messy source data into trusted canonical models.
Project description
CanonIQ
AI-Powered Canonical Mapping Engine — map messy source data into trusted canonical models.
CanonIQ profiles a source dataset, loads a versioned canonical schema, and proposes scored, explained source→canonical field mappings. It then generates validation rules, transforms data into the canonical shape, and detects schema drift when a new ingestion arrives.
CanonIQ is domain-agnostic and source-agnostic. Higher education is one bundled example — not the product. The same engine maps retail catalogs, healthcare patient records, financial transactions, and logistics shipments against their respective industry standards.
Local-first by default. CanonIQ does not send your source data, schemas, sample values, or mappings to any external service. There is no telemetry. Network access only ever happens if you explicitly configure an optional external AI adapter.
Why CanonIQ
Onboarding external data means reconciling someone else's column names, types, and conventions with your own canonical model. That work is repetitive, error-prone, and hard to audit. CanonIQ turns it into a deterministic, explainable pipeline:
- Profile any source (CSV / JSON / JSONL today) — inferred types, patterns, PII flags.
- Suggest mappings with a confidence score and human-readable reasons per field.
- Gate suggestions into
auto_approved/requires_review/low_confidence. - Validate with generated rules including format checksums (IBAN, GTIN, NPI, LEI…).
- Transform to the canonical output, coercing types and dropping unmapped fields.
- Detect drift when a later ingestion renames, retypes, adds, or removes fields.
Everything is typed (Pydantic v2), tested, and runs offline.
Install
# From PyPI
pip install canoniq
# From source (for development, or to run the bundled demos)
git clone https://github.com/Buchiexplores/canoniq.git
cd canoniq
pip install -e .
The bundled
demodatasets ship inside the package, socanoniq demo <domain>works straight from apip install— no clone required.
Optional extras
CanonIQ keeps the core dependency footprint small. Enterprise connectors are currently
placeholders (they raise a clear NotImplementedError naming the target version and required
extra) but the dependency groups are wired so future releases install cleanly.
| Extra | Install | Adds |
|---|---|---|
files |
pip install "canoniq[files]" |
Parquet, Excel readers (planned) |
databases |
pip install "canoniq[databases]" |
Postgres, MySQL, SQLite (planned) |
bigquery |
pip install "canoniq[bigquery]" |
BigQuery (planned) |
snowflake |
pip install "canoniq[snowflake]" |
Snowflake (planned) |
aws |
pip install "canoniq[aws]" |
S3, Redshift (planned) |
gcp |
pip install "canoniq[gcp]" |
GCS, BigQuery (planned) |
azure |
pip install "canoniq[azure]" |
Azure Blob (planned) |
ai |
pip install "canoniq[ai]" |
Local semantic matching adapter (sentence-transformers, off by default) |
all |
pip install "canoniq[all]" |
Everything above |
dev |
pip install "canoniq[dev]" |
pytest, mypy, ruff, coverage |
Quickstart (CLI)
# Run the full pipeline end-to-end against a bundled example
canoniq demo higher-ed
canoniq demo retail
canoniq demo healthcare
canoniq demo finance
canoniq demo logistics
# Profile a file directly...
canoniq profile --source examples/higher_ed/source_students.csv --out profile.json
# ...or via a source-config (secrets resolved from ${ENV})
canoniq profile --source-config examples/sources/local_csv_students.yml --out profile.json
# Suggest → rules → apply → drift
canoniq suggest --profile profile.json --canonical examples/higher_ed/canonical_student.yml --out suggestions.json
canoniq rules --suggestions suggestions.json --canonical examples/higher_ed/canonical_student.yml --out rules.yml
canoniq apply --source examples/higher_ed/source_students.csv --mapping suggestions.json \
--canonical examples/higher_ed/canonical_student.yml --out canonical.csv --include-review
canoniq drift-check --source examples/higher_ed/new_source_students.csv --mapping suggestions.json \
--canonical examples/higher_ed/canonical_student.yml --out drift.json
# Auto-onboard a provider (config-driven) and score deployment readiness
canoniq onboard --config examples/retail_vendor_onboarding/onboarding_configs/brightmart_distribution.yml
canoniq onboard-batch --config-dir examples/retail_vendor_onboarding/onboarding_configs \
--combined-out examples/retail_vendor_onboarding/output/combined_readiness.json
Auto-onboarding (config-driven)
Beyond the per-file pipeline, CanonIQ can auto-onboard whole providers from a YAML config: profile every source, map it onto your canonical models, validate, drift-check, and emit a single deployment-readiness score with a clear next action — no deployment happens, you get a verdict plus the canonical artifacts.
A provider is whatever supplies data in your domain (a school, a vendor, a hospital, a SaaS tenant). Two complete examples ship — same engine, different industry: higher education and retail vendors.
from canoniq.onboarding import onboard_provider
report = onboard_provider("path/to/provider.yml")
if report.auto_deploy_allowed:
deploy(report) # your deploy step
else:
notify_reviewer(report) # route to a human; see report.next_action
See the domain-neutral Auto-Onboarding Guide to build a pipeline for any field.
Quickstart (SDK)
from canoniq import CanonIQ
engine = CanonIQ() # local-first defaults
profile = engine.profile_source("examples/retail/source_products.csv")
mapping = engine.suggest_mappings(profile, "examples/retail/canonical_product.yml")
for m in mapping.mappings:
print(f"{m.source_field:>16} -> {m.canonical_field or '(none)':<16} "
f"{m.confidence:.2f} {m.status} {', '.join(m.reasons)}")
rules = engine.generate_validation_rules(mapping, "examples/retail/canonical_product.yml", profile)
result = engine.apply_mapping(
"examples/retail/source_products.csv", mapping,
"examples/retail/canonical_product.yml", include_review=True,
)
report = engine.detect_drift(
"examples/retail/new_source_products.csv", mapping,
"examples/retail/canonical_product.yml",
)
print(report.status) # "ok" or "drift_detected"
Configuration
Every tunable lives in one YAML file — thresholds, scoring weights, sampling, PII
masking, and the optional AI adapter. Pass it to any CLI command with --config, or
load it in the SDK. Nothing is required; unspecified keys fall back to local-first
defaults. A fully commented example lives at
examples/config/canoniq.yml.
canoniq demo retail --config examples/config/canoniq.yml
canoniq suggest --profile profile.json --canonical canonical.yml --config examples/config/canoniq.yml
from canoniq import CanonIQ
from canoniq.config import CanonIQConfig
engine = CanonIQ(CanonIQConfig.from_yaml("examples/config/canoniq.yml"))
What AI model powers the mapping?
By default, none — and that's intentional. Core matching is a deterministic ensemble of five signals (alias, name, type, pattern, range) with weighted, explained confidence scores. It runs fully offline with zero network calls — the local-first guarantee.
An optional sixth "semantic" signal is pluggable and off by default. Choose a provider declaratively — no code. Three embedding backends ship today:
| Provider | Runs | Egress | Default model | API key |
|---|---|---|---|---|
sentence-transformers (aliases sbert, local) |
locally, on-device | none | all-MiniLM-L6-v2 |
— |
openai |
OpenAI API | field names only | text-embedding-3-small |
OPENAI_API_KEY |
gemini (aliases google) |
Gemini API | field names only | text-embedding-004 |
GEMINI_API_KEY |
# canoniq.yml — local, zero egress
ai:
provider: sentence-transformers
model: all-MiniLM-L6-v2 # any model version; omit to use the provider default
weight: 0.15 # semantic contribution (auto-applied when enabled)
# canoniq.yml — hosted (opt-in; sends field names to the provider)
ai:
provider: openai # or: gemini
model: text-embedding-3-large
api_key_env: OPENAI_API_KEY # env var name only — keys are never stored in config
weight: 0.15
# local adapter needs the extra; hosted adapters need only an API key (stdlib HTTP)
pip install "canoniq[ai]" # only for sentence-transformers
export OPENAI_API_KEY=sk-...
canoniq suggest --profile profile.json --canonical canonical.yml --config canoniq.yml
Privacy guarantees for hosted providers: only source field names and canonical schema text are sent — never sample values (so masked PII/PHI never leaves). Keys come from an environment variable, never the config file. Offline or missing key → clear error; the deterministic pipeline always works without any adapter.
Anthropic / Claude has no first-party text-embeddings API, so it can't power the embedding signal — configuring it fails fast with guidance. Claude is intended for a future optional LLM reasoning stage (resolving the
requires_reviewband), not embeddings.
Plug in your own adapter (any provider, or a private model) by implementing
BaseAIMatcher and registering it:
from canoniq.ai import BaseAIMatcher, register_ai_provider
class MyMatcher(BaseAIMatcher):
def semantic_score(self, source_field, canonical_field) -> float:
... # return a similarity in [0, 1]
register_ai_provider("my-matcher", lambda cfg: MyMatcher())
# then set ai.provider: my-matcher in your config
Use cases
CanonIQ fits anywhere you repeatedly ingest external data into a trusted model:
- Higher education — map a new SIS/LMS export onto your OneRoster/Ed-Fi/CEDS student model.
- Retail — normalize supplier catalogs to a GS1 GTIN / schema.org product model.
- Healthcare — align partner feeds to an HL7 FHIR R4
Patientmodel with PHI masking. - Finance — reconcile bank/payment files to an ISO 20022 transaction model with IBAN checks.
- Logistics — unify carrier feeds to a GS1 SSCC / SCAC shipment model.
- SaaS onboarding — turn every customer's CSV upload into your internal schema automatically.
- AI agent platforms — give an agent a deterministic, explainable tool for schema mapping.
See the per-domain walkthroughs in docs/ and the runnable sample use-cases in
examples/ (python examples/<domain>/demo.py).
Documentation
New here? Start with the Education & Onboarding Guide — a plain-English walkthrough of what CanonIQ is, why it's built this way, how to demo it, and how to extend it.
| Doc | What it covers |
|---|---|
| docs/education/ | Plain-English guide: approach, architecture, onboarding, demos, use cases |
| docs/quickstart.md | Install, first pipeline, CLI + SDK |
| docs/demos.md | STAR walkthroughs of all 5 domain demos: the problem each solves + real output |
| docs/onboarding.md | Config-driven auto-onboarding: readiness scoring, build-your-own, enterprise adoption |
| docs/concepts.md | Profiles, schemas, scoring, gating, drift |
| docs/architecture.md | Module layout and data flow |
| docs/connectors.md | How to add a connector |
| docs/sources.md | Source-config format, secrets, sampling |
| docs/domain_packs.md | How to add a new domain |
| docs/standards_mapping.md | Canonical fields ↔ industry standards |
| docs/roadmap.md | Release plan |
Per-domain use cases: higher-ed · retail · healthcare · finance · logistics
Privacy & security
- Local-first: no source data, schemas, or mappings leave your machine in the core package.
- No telemetry in the MVP. No external API calls from the core.
- High-PII/PHI sample values are masked by default before they leave the profiler.
- All bundled example data is synthetic. No secrets live in the repo — source configs reference
${ENV}variables only.
Report vulnerabilities per SECURITY.md.
Development
pip install -e ".[dev]"
pytest --cov=canoniq --cov-report=term-missing # ≥80% coverage, zero network calls
ruff check canoniq
mypy canoniq
See CONTRIBUTING.md and CODE_OF_CONDUCT.md.
License
Apache-2.0 © The CanonIQ contributors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canoniq-0.3.1.tar.gz.
File metadata
- Download URL: canoniq-0.3.1.tar.gz
- Upload date:
- Size: 98.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b8ce1c608143631e902c8fe83f47910535d379c8178174ea572846b3ff0fa11
|
|
| MD5 |
dcb81fa31f1114b415523d8358604c71
|
|
| BLAKE2b-256 |
b7a4a0440982c6e22584da1d2a9217d14ab86ee018c20c0aa4f1e4a9d6d6a1a0
|
Provenance
The following attestation bundles were made for canoniq-0.3.1.tar.gz:
Publisher:
release.yml on Buchiexplores/canoniq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
canoniq-0.3.1.tar.gz -
Subject digest:
0b8ce1c608143631e902c8fe83f47910535d379c8178174ea572846b3ff0fa11 - Sigstore transparency entry: 1693721111
- Sigstore integration time:
-
Permalink:
Buchiexplores/canoniq@9760e417d2aa244bd564bf19d266f08f37afd279 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/Buchiexplores
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9760e417d2aa244bd564bf19d266f08f37afd279 -
Trigger Event:
push
-
Statement type:
File details
Details for the file canoniq-0.3.1-py3-none-any.whl.
File metadata
- Download URL: canoniq-0.3.1-py3-none-any.whl
- Upload date:
- Size: 113.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f6e1f6166595c508710cdf77aa597ab87c577be376a57f14cb677b6cfccb1d1
|
|
| MD5 |
41dacba45d94263676f3dc689c575876
|
|
| BLAKE2b-256 |
17928cf91cbf6f7d796348fb8c510835bed43477fb75b318082e8a5465c6f94d
|
Provenance
The following attestation bundles were made for canoniq-0.3.1-py3-none-any.whl:
Publisher:
release.yml on Buchiexplores/canoniq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
canoniq-0.3.1-py3-none-any.whl -
Subject digest:
8f6e1f6166595c508710cdf77aa597ab87c577be376a57f14cb677b6cfccb1d1 - Sigstore transparency entry: 1693721210
- Sigstore integration time:
-
Permalink:
Buchiexplores/canoniq@9760e417d2aa244bd564bf19d266f08f37afd279 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/Buchiexplores
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9760e417d2aa244bd564bf19d266f08f37afd279 -
Trigger Event:
push
-
Statement type: