Skip to main content

AI-Powered Canonical Mapping Engine — map messy source data into trusted canonical models.

Project description

CanonIQ

CI License: Apache-2.0 Python Coverage Code style: ruff Typed

AI-Powered Canonical Mapping Engine — map messy source data into trusted canonical models.

CanonIQ profiles a source dataset, loads a versioned canonical schema, and proposes scored, explained source→canonical field mappings. It then generates validation rules, transforms data into the canonical shape, and detects schema drift when a new ingestion arrives.

CanonIQ is domain-agnostic and source-agnostic. Higher education is one bundled example — not the product. The same engine maps retail catalogs, healthcare patient records, financial transactions, and logistics shipments against their respective industry standards.

Local-first by default. CanonIQ does not send your source data, schemas, sample values, or mappings to any external service. There is no telemetry. Network access only ever happens if you explicitly configure an optional external AI adapter.


Why CanonIQ

Onboarding external data means reconciling someone else's column names, types, and conventions with your own canonical model. That work is repetitive, error-prone, and hard to audit. CanonIQ turns it into a deterministic, explainable pipeline:

  • Profile any source (CSV / JSON / JSONL today) — inferred types, patterns, PII flags.
  • Suggest mappings with a confidence score and human-readable reasons per field.
  • Gate suggestions into auto_approved / requires_review / low_confidence.
  • Validate with generated rules including format checksums (IBAN, GTIN, NPI, LEI…).
  • Transform to the canonical output, coercing types and dropping unmapped fields.
  • Detect drift when a later ingestion renames, retypes, adds, or removes fields.

Everything is typed (Pydantic v2), tested, and runs offline.


Install

# From source (recommended today — see note below)
git clone https://github.com/Buchiexplores/canoniq.git
cd canoniq
pip install -e .

# From PyPI (coming soon — not yet published)
pip install canoniq

Note: CanonIQ is not on PyPI yet. Install from source for now. The runnable demo datasets ship with the repository (not the wheel), so a source checkout is also the easiest way to try the bundled examples.

Optional extras

CanonIQ keeps the core dependency footprint small. Enterprise connectors are placeholders in v0.1 (they raise a clear NotImplementedError naming the target version and required extra) but the dependency groups are wired so future releases install cleanly.

Extra Install Adds
files pip install "canoniq[files]" Parquet, Excel readers (planned)
databases pip install "canoniq[databases]" Postgres, MySQL, SQLite (planned)
bigquery pip install "canoniq[bigquery]" BigQuery (planned)
snowflake pip install "canoniq[snowflake]" Snowflake (planned)
aws pip install "canoniq[aws]" S3, Redshift (planned)
gcp pip install "canoniq[gcp]" GCS, BigQuery (planned)
azure pip install "canoniq[azure]" Azure Blob (planned)
ai pip install "canoniq[ai]" Local semantic matching adapter (sentence-transformers, off by default)
all pip install "canoniq[all]" Everything above
dev pip install "canoniq[dev]" pytest, mypy, ruff, coverage

Quickstart (CLI)

# Run the full pipeline end-to-end against a bundled example
canoniq demo higher-ed
canoniq demo retail
canoniq demo healthcare
canoniq demo finance
canoniq demo logistics

# Profile a file directly...
canoniq profile --source examples/higher_ed/source_students.csv --out profile.json

# ...or via a source-config (secrets resolved from ${ENV})
canoniq profile --source-config examples/sources/local_csv_students.yml --out profile.json

# Suggest → rules → apply → drift
canoniq suggest    --profile profile.json --canonical examples/higher_ed/canonical_student.yml --out suggestions.json
canoniq rules      --suggestions suggestions.json --canonical examples/higher_ed/canonical_student.yml --out rules.yml
canoniq apply      --source examples/higher_ed/source_students.csv --mapping suggestions.json \
                   --canonical examples/higher_ed/canonical_student.yml --out canonical.csv --include-review
canoniq drift-check --source examples/higher_ed/new_source_students.csv --mapping suggestions.json \
                   --canonical examples/higher_ed/canonical_student.yml --out drift.json

# Auto-onboard a provider (config-driven) and score deployment readiness
canoniq onboard       --config examples/retail_vendor_onboarding/onboarding_configs/brightmart_distribution.yml
canoniq onboard-batch --config-dir examples/retail_vendor_onboarding/onboarding_configs \
                   --combined-out examples/retail_vendor_onboarding/output/combined_readiness.json

Auto-onboarding (config-driven)

Beyond the per-file pipeline, CanonIQ can auto-onboard whole providers from a YAML config: profile every source, map it onto your canonical models, validate, drift-check, and emit a single deployment-readiness score with a clear next action — no deployment happens, you get a verdict plus the canonical artifacts.

A provider is whatever supplies data in your domain (a school, a vendor, a hospital, a SaaS tenant). Two complete examples ship — same engine, different industry: higher education and retail vendors.

from canoniq.onboarding import onboard_provider

report = onboard_provider("path/to/provider.yml")
if report.auto_deploy_allowed:
    deploy(report)            # your deploy step
else:
    notify_reviewer(report)   # route to a human; see report.next_action

See the domain-neutral Auto-Onboarding Guide to build a pipeline for any field.

Quickstart (SDK)

from canoniq import CanonIQ

engine = CanonIQ()  # local-first defaults

profile = engine.profile_source("examples/retail/source_products.csv")
mapping = engine.suggest_mappings(profile, "examples/retail/canonical_product.yml")

for m in mapping.mappings:
    print(f"{m.source_field:>16} -> {m.canonical_field or '(none)':<16} "
          f"{m.confidence:.2f} {m.status}  {', '.join(m.reasons)}")

rules = engine.generate_validation_rules(mapping, "examples/retail/canonical_product.yml", profile)
result = engine.apply_mapping(
    "examples/retail/source_products.csv", mapping,
    "examples/retail/canonical_product.yml", include_review=True,
)
report = engine.detect_drift(
    "examples/retail/new_source_products.csv", mapping,
    "examples/retail/canonical_product.yml",
)
print(report.status)  # "ok" or "drift_detected"

Configuration

Every tunable lives in one YAML file — thresholds, scoring weights, sampling, PII masking, and the optional AI adapter. Pass it to any CLI command with --config, or load it in the SDK. Nothing is required; unspecified keys fall back to local-first defaults. A fully commented example lives at examples/config/canoniq.yml.

canoniq demo retail --config examples/config/canoniq.yml
canoniq suggest --profile profile.json --canonical canonical.yml --config examples/config/canoniq.yml
from canoniq import CanonIQ
from canoniq.config import CanonIQConfig

engine = CanonIQ(CanonIQConfig.from_yaml("examples/config/canoniq.yml"))

What AI model powers the mapping?

By default, none — and that's intentional. Core matching is a deterministic ensemble of five signals (alias, name, type, pattern, range) with weighted, explained confidence scores. It runs fully offline with zero network calls — the local-first guarantee.

An optional sixth "semantic" signal is pluggable and off by default. Choose a provider declaratively — no code. Three embedding backends ship today:

Provider Runs Egress Default model API key
sentence-transformers (aliases sbert, local) locally, on-device none all-MiniLM-L6-v2
openai OpenAI API field names only text-embedding-3-small OPENAI_API_KEY
gemini (aliases google) Gemini API field names only text-embedding-004 GEMINI_API_KEY
# canoniq.yml — local, zero egress
ai:
  provider: sentence-transformers
  model: all-MiniLM-L6-v2     # any model version; omit to use the provider default
  weight: 0.15                # semantic contribution (auto-applied when enabled)
# canoniq.yml — hosted (opt-in; sends field names to the provider)
ai:
  provider: openai            # or: gemini
  model: text-embedding-3-large
  api_key_env: OPENAI_API_KEY # env var name only — keys are never stored in config
  weight: 0.15
# local adapter needs the extra; hosted adapters need only an API key (stdlib HTTP)
pip install "canoniq[ai]"   # only for sentence-transformers
export OPENAI_API_KEY=sk-...
canoniq suggest --profile profile.json --canonical canonical.yml --config canoniq.yml

Privacy guarantees for hosted providers: only source field names and canonical schema text are sent — never sample values (so masked PII/PHI never leaves). Keys come from an environment variable, never the config file. Offline or missing key → clear error; the deterministic pipeline always works without any adapter.

Anthropic / Claude has no first-party text-embeddings API, so it can't power the embedding signal — configuring it fails fast with guidance. Claude is intended for a future optional LLM reasoning stage (resolving the requires_review band), not embeddings.

Plug in your own adapter (any provider, or a private model) by implementing BaseAIMatcher and registering it:

from canoniq.ai import BaseAIMatcher, register_ai_provider

class MyMatcher(BaseAIMatcher):
    def semantic_score(self, source_field, canonical_field) -> float:
        ...  # return a similarity in [0, 1]

register_ai_provider("my-matcher", lambda cfg: MyMatcher())
# then set ai.provider: my-matcher in your config

Use cases

CanonIQ fits anywhere you repeatedly ingest external data into a trusted model:

  • Higher education — map a new SIS/LMS export onto your OneRoster/Ed-Fi/CEDS student model.
  • Retail — normalize supplier catalogs to a GS1 GTIN / schema.org product model.
  • Healthcare — align partner feeds to an HL7 FHIR R4 Patient model with PHI masking.
  • Finance — reconcile bank/payment files to an ISO 20022 transaction model with IBAN checks.
  • Logistics — unify carrier feeds to a GS1 SSCC / SCAC shipment model.
  • SaaS onboarding — turn every customer's CSV upload into your internal schema automatically.
  • AI agent platforms — give an agent a deterministic, explainable tool for schema mapping.

See the per-domain walkthroughs in docs/ and the runnable sample use-cases in examples/ (python examples/<domain>/demo.py).


Documentation

New here? Start with the Education & Onboarding Guide — a plain-English walkthrough of what CanonIQ is, why it's built this way, how to demo it, and how to extend it.

Doc What it covers
docs/education/ Plain-English guide: approach, architecture, onboarding, demos, use cases
docs/quickstart.md Install, first pipeline, CLI + SDK
docs/onboarding.md Config-driven auto-onboarding: readiness scoring, build-your-own, enterprise adoption
docs/concepts.md Profiles, schemas, scoring, gating, drift
docs/architecture.md Module layout and data flow
docs/connectors.md How to add a connector
docs/sources.md Source-config format, secrets, sampling
docs/domain_packs.md How to add a new domain
docs/standards_mapping.md Canonical fields ↔ industry standards
docs/roadmap.md Release plan

Per-domain use cases: higher-ed · retail · healthcare · finance · logistics


Privacy & security

  • Local-first: no source data, schemas, or mappings leave your machine in the core package.
  • No telemetry in the MVP. No external API calls from the core.
  • High-PII/PHI sample values are masked by default before they leave the profiler.
  • All bundled example data is synthetic. No secrets live in the repo — source configs reference ${ENV} variables only.

Report vulnerabilities per SECURITY.md.


Development

pip install -e ".[dev]"
pytest --cov=canoniq --cov-report=term-missing   # ≥80% coverage, zero network calls
ruff check canoniq
mypy canoniq

See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

License

Apache-2.0 © The CanonIQ contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canoniq-0.2.0.tar.gz (79.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canoniq-0.2.0-py3-none-any.whl (82.1 kB view details)

Uploaded Python 3

File details

Details for the file canoniq-0.2.0.tar.gz.

File metadata

  • Download URL: canoniq-0.2.0.tar.gz
  • Upload date:
  • Size: 79.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for canoniq-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1544fbe98cd215647118ae946d6608d6dbec54a415ba6385ece8b4eb1dc3283d
MD5 abd3f237f9a0c03c717dfc1758b1c3d5
BLAKE2b-256 b39bc757b30bbb11ffb44b95f99ab4360cb28a65708b1ebdca65f22eacbf2e30

See more details on using hashes here.

Provenance

The following attestation bundles were made for canoniq-0.2.0.tar.gz:

Publisher: release.yml on Buchiexplores/canoniq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file canoniq-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: canoniq-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 82.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for canoniq-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11aa349fcd23ef1beb9f22138aa0f64ff10d4f92d721a36111786890b34cf035
MD5 fd598dcbaababcd502f94b0bbac1093f
BLAKE2b-256 62857faa71f59fda84d2cec49ec0d83e7255931e3e1317019151ae212e1d7f61

See more details on using hashes here.

Provenance

The following attestation bundles were made for canoniq-0.2.0-py3-none-any.whl:

Publisher: release.yml on Buchiexplores/canoniq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page