AI-Powered Canonical Mapping Engine — map messy source data into trusted canonical models.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

buchiexplores

These details have not been verified by PyPI

Project description

CanonIQ

AI-Powered Canonical Mapping Engine — map messy source data into trusted canonical models.

CanonIQ profiles a source dataset, loads a versioned canonical schema, and proposes scored, explained source→canonical field mappings. It then generates validation rules, transforms data into the canonical shape, and detects schema drift when a new ingestion arrives.

CanonIQ is domain-agnostic and source-agnostic. Higher education is one bundled example — not the product. The same engine maps retail catalogs, healthcare patient records, financial transactions, and logistics shipments against their respective industry standards.

Local-first by default. CanonIQ does not send your source data, schemas, sample values, or mappings to any external service. There is no telemetry. Network access only ever happens if you explicitly configure an optional external AI adapter.

Why CanonIQ

Onboarding external data means reconciling someone else's column names, types, and conventions with your own canonical model. That work is repetitive, error-prone, and hard to audit. CanonIQ turns it into a deterministic, explainable pipeline:

Profile any source (CSV / JSON / JSONL today) — inferred types, patterns, PII flags.
Suggest mappings with a confidence score and human-readable reasons per field.
Gate suggestions into auto_approved / requires_review / low_confidence.
Validate with generated rules including format checksums (IBAN, GTIN, NPI, LEI…).
Transform to the canonical output, coercing types and dropping unmapped fields.
Detect drift when a later ingestion renames, retypes, adds, or removes fields.

Everything is typed (Pydantic v2), tested, and runs offline.

Install

# From source (recommended today — see note below)
git clone https://github.com/Buchiexplores/canoniq.git
cd canoniq
pip install -e .

# From PyPI (coming soon — not yet published)
pip install canoniq

Note: CanonIQ is not on PyPI yet. Install from source for now. The runnable demo datasets ship with the repository (not the wheel), so a source checkout is also the easiest way to try the bundled examples.

Optional extras

CanonIQ keeps the core dependency footprint small. Enterprise connectors are placeholders in v0.1 (they raise a clear NotImplementedError naming the target version and required extra) but the dependency groups are wired so future releases install cleanly.

Extra	Install	Adds
`files`	`pip install "canoniq[files]"`	Parquet, Excel readers (planned)
`databases`	`pip install "canoniq[databases]"`	Postgres, MySQL, SQLite (planned)
`bigquery`	`pip install "canoniq[bigquery]"`	BigQuery (planned)
`snowflake`	`pip install "canoniq[snowflake]"`	Snowflake (planned)
`aws`	`pip install "canoniq[aws]"`	S3, Redshift (planned)
`gcp`	`pip install "canoniq[gcp]"`	GCS, BigQuery (planned)
`azure`	`pip install "canoniq[azure]"`	Azure Blob (planned)
`ai`	`pip install "canoniq[ai]"`	Local semantic matching adapter (sentence-transformers, off by default)
`all`	`pip install "canoniq[all]"`	Everything above
`dev`	`pip install "canoniq[dev]"`	pytest, mypy, ruff, coverage

Quickstart (CLI)

# Run the full pipeline end-to-end against a bundled example
canoniq demo higher-ed
canoniq demo retail
canoniq demo healthcare
canoniq demo finance
canoniq demo logistics

# Profile a file directly...
canoniq profile --source examples/higher_ed/source_students.csv --out profile.json

# ...or via a source-config (secrets resolved from ${ENV})
canoniq profile --source-config examples/sources/local_csv_students.yml --out profile.json

# Suggest → rules → apply → drift
canoniq suggest    --profile profile.json --canonical examples/higher_ed/canonical_student.yml --out suggestions.json
canoniq rules      --suggestions suggestions.json --canonical examples/higher_ed/canonical_student.yml --out rules.yml
canoniq apply      --source examples/higher_ed/source_students.csv --mapping suggestions.json \
                   --canonical examples/higher_ed/canonical_student.yml --out canonical.csv --include-review
canoniq drift-check --source examples/higher_ed/new_source_students.csv --mapping suggestions.json \
                   --canonical examples/higher_ed/canonical_student.yml --out drift.json

# Auto-onboard a provider (config-driven) and score deployment readiness
canoniq onboard       --config examples/retail_vendor_onboarding/onboarding_configs/brightmart_distribution.yml
canoniq onboard-batch --config-dir examples/retail_vendor_onboarding/onboarding_configs \
                   --combined-out examples/retail_vendor_onboarding/output/combined_readiness.json

Auto-onboarding (config-driven)

Beyond the per-file pipeline, CanonIQ can auto-onboard whole providers from a YAML config: profile every source, map it onto your canonical models, validate, drift-check, and emit a single deployment-readiness score with a clear next action — no deployment happens, you get a verdict plus the canonical artifacts.

A provider is whatever supplies data in your domain (a school, a vendor, a hospital, a SaaS tenant). Two complete examples ship — same engine, different industry: higher education and retail vendors.

from canoniq.onboarding import onboard_provider

report = onboard_provider("path/to/provider.yml")
if report.auto_deploy_allowed:
    deploy(report)            # your deploy step
else:
    notify_reviewer(report)   # route to a human; see report.next_action

See the domain-neutral Auto-Onboarding Guide to build a pipeline for any field.

Quickstart (SDK)

from canoniq import CanonIQ

engine = CanonIQ()  # local-first defaults

profile = engine.profile_source("examples/retail/source_products.csv")
mapping = engine.suggest_mappings(profile, "examples/retail/canonical_product.yml")

for m in mapping.mappings:
    print(f"{m.source_field:>16} -> {m.canonical_field or '(none)':<16} "
          f"{m.confidence:.2f} {m.status}  {', '.join(m.reasons)}")

rules = engine.generate_validation_rules(mapping, "examples/retail/canonical_product.yml", profile)
result = engine.apply_mapping(
    "examples/retail/source_products.csv", mapping,
    "examples/retail/canonical_product.yml", include_review=True,
)
report = engine.detect_drift(
    "examples/retail/new_source_products.csv", mapping,
    "examples/retail/canonical_product.yml",
)
print(report.status)  # "ok" or "drift_detected"

Configuration

Every tunable lives in one YAML file — thresholds, scoring weights, sampling, PII masking, and the optional AI adapter. Pass it to any CLI command with --config, or load it in the SDK. Nothing is required; unspecified keys fall back to local-first defaults. A fully commented example lives at examples/config/canoniq.yml.

canoniq demo retail --config examples/config/canoniq.yml
canoniq suggest --profile profile.json --canonical canonical.yml --config examples/config/canoniq.yml

from canoniq import CanonIQ
from canoniq.config import CanonIQConfig

engine = CanonIQ(CanonIQConfig.from_yaml("examples/config/canoniq.yml"))

What AI model powers the mapping?

By default, none — and that's intentional. Core matching is a deterministic ensemble of five signals (alias, name, type, pattern, range) with weighted, explained confidence scores. It runs fully offline with zero network calls — the local-first guarantee.

An optional sixth "semantic" signal is pluggable and off by default. Choose a provider declaratively — no code. Three embedding backends ship today:

Provider	Runs	Egress	Default model	API key
`sentence-transformers` (aliases `sbert`, `local`)	locally, on-device	none	`all-MiniLM-L6-v2`	—
`openai`	OpenAI API	field names only	`text-embedding-3-small`	`OPENAI_API_KEY`
`gemini` (aliases `google`)	Gemini API	field names only	`text-embedding-004`	`GEMINI_API_KEY`

# canoniq.yml — local, zero egress
ai:
  provider: sentence-transformers
  model: all-MiniLM-L6-v2     # any model version; omit to use the provider default
  weight: 0.15                # semantic contribution (auto-applied when enabled)

# canoniq.yml — hosted (opt-in; sends field names to the provider)
ai:
  provider: openai            # or: gemini
  model: text-embedding-3-large
  api_key_env: OPENAI_API_KEY # env var name only — keys are never stored in config
  weight: 0.15

# local adapter needs the extra; hosted adapters need only an API key (stdlib HTTP)
pip install "canoniq[ai]"   # only for sentence-transformers
export OPENAI_API_KEY=sk-...
canoniq suggest --profile profile.json --canonical canonical.yml --config canoniq.yml

Privacy guarantees for hosted providers: only source field names and canonical schema text are sent — never sample values (so masked PII/PHI never leaves). Keys come from an environment variable, never the config file. Offline or missing key → clear error; the deterministic pipeline always works without any adapter.

Anthropic / Claude has no first-party text-embeddings API, so it can't power the embedding signal — configuring it fails fast with guidance. Claude is intended for a future optional LLM reasoning stage (resolving the requires_review band), not embeddings.

Plug in your own adapter (any provider, or a private model) by implementing BaseAIMatcher and registering it:

from canoniq.ai import BaseAIMatcher, register_ai_provider

class MyMatcher(BaseAIMatcher):
    def semantic_score(self, source_field, canonical_field) -> float:
        ...  # return a similarity in [0, 1]

register_ai_provider("my-matcher", lambda cfg: MyMatcher())
# then set ai.provider: my-matcher in your config

Use cases

CanonIQ fits anywhere you repeatedly ingest external data into a trusted model:

Higher education — map a new SIS/LMS export onto your OneRoster/Ed-Fi/CEDS student model.
Retail — normalize supplier catalogs to a GS1 GTIN / schema.org product model.
Healthcare — align partner feeds to an HL7 FHIR R4 Patient model with PHI masking.
Finance — reconcile bank/payment files to an ISO 20022 transaction model with IBAN checks.
Logistics — unify carrier feeds to a GS1 SSCC / SCAC shipment model.
SaaS onboarding — turn every customer's CSV upload into your internal schema automatically.
AI agent platforms — give an agent a deterministic, explainable tool for schema mapping.

See the per-domain walkthroughs in docs/ and the runnable sample use-cases in examples/ (python examples/<domain>/demo.py).

Documentation

New here? Start with the Education & Onboarding Guide — a plain-English walkthrough of what CanonIQ is, why it's built this way, how to demo it, and how to extend it.

Doc	What it covers
docs/education/	Plain-English guide: approach, architecture, onboarding, demos, use cases
docs/quickstart.md	Install, first pipeline, CLI + SDK
docs/onboarding.md	Config-driven auto-onboarding: readiness scoring, build-your-own, enterprise adoption
docs/concepts.md	Profiles, schemas, scoring, gating, drift
docs/architecture.md	Module layout and data flow
docs/connectors.md	How to add a connector
docs/sources.md	Source-config format, secrets, sampling
docs/domain_packs.md	How to add a new domain
docs/standards_mapping.md	Canonical fields ↔ industry standards
docs/roadmap.md	Release plan

Per-domain use cases: higher-ed · retail · healthcare · finance · logistics

Privacy & security

Local-first: no source data, schemas, or mappings leave your machine in the core package.
No telemetry in the MVP. No external API calls from the core.
High-PII/PHI sample values are masked by default before they leave the profiler.
All bundled example data is synthetic. No secrets live in the repo — source configs reference ${ENV} variables only.

Report vulnerabilities per SECURITY.md.

Development

pip install -e ".[dev]"
pytest --cov=canoniq --cov-report=term-missing   # ≥80% coverage, zero network calls
ruff check canoniq
mypy canoniq

See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

License

Apache-2.0 © The CanonIQ contributors.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

buchiexplores

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.1

Jun 1, 2026

0.3.0

Jun 1, 2026

0.2.2

Jun 1, 2026

0.2.1

Jun 1, 2026

This version

0.2.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canoniq-0.2.0.tar.gz (79.1 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

canoniq-0.2.0-py3-none-any.whl (82.1 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file canoniq-0.2.0.tar.gz.

File metadata

Download URL: canoniq-0.2.0.tar.gz
Upload date: Jun 1, 2026
Size: 79.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for canoniq-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`1544fbe98cd215647118ae946d6608d6dbec54a415ba6385ece8b4eb1dc3283d`
MD5	`abd3f237f9a0c03c717dfc1758b1c3d5`
BLAKE2b-256	`b39bc757b30bbb11ffb44b95f99ab4360cb28a65708b1ebdca65f22eacbf2e30`

See more details on using hashes here.

Provenance

The following attestation bundles were made for canoniq-0.2.0.tar.gz:

Publisher: release.yml on Buchiexplores/canoniq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: canoniq-0.2.0.tar.gz
- Subject digest: 1544fbe98cd215647118ae946d6608d6dbec54a415ba6385ece8b4eb1dc3283d
- Sigstore transparency entry: 1690355434
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: Buchiexplores/canoniq@a2b77eb4e6717033e0d4e773e6073363e560471f
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Buchiexplores
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a2b77eb4e6717033e0d4e773e6073363e560471f
- Trigger Event: push

File details

Details for the file canoniq-0.2.0-py3-none-any.whl.

File metadata

Download URL: canoniq-0.2.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 82.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for canoniq-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11aa349fcd23ef1beb9f22138aa0f64ff10d4f92d721a36111786890b34cf035`
MD5	`fd598dcbaababcd502f94b0bbac1093f`
BLAKE2b-256	`62857faa71f59fda84d2cec49ec0d83e7255931e3e1317019151ae212e1d7f61`

See more details on using hashes here.

Provenance

The following attestation bundles were made for canoniq-0.2.0-py3-none-any.whl:

Publisher: release.yml on Buchiexplores/canoniq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: canoniq-0.2.0-py3-none-any.whl
- Subject digest: 11aa349fcd23ef1beb9f22138aa0f64ff10d4f92d721a36111786890b34cf035
- Sigstore transparency entry: 1690355517
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: Buchiexplores/canoniq@a2b77eb4e6717033e0d4e773e6073363e560471f
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Buchiexplores
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a2b77eb4e6717033e0d4e773e6073363e560471f
- Trigger Event: push

canoniq 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CanonIQ

Why CanonIQ

Install

Optional extras

Quickstart (CLI)

Auto-onboarding (config-driven)

Quickstart (SDK)

Configuration

What AI model powers the mapping?

Use cases

Documentation

Privacy & security

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance