Dictionary-first extraction, alias canonicalization, and lightweight reranking utilities for technical documents.
Project description
skeinrank-core
skeinrank-core is the lightweight Python SDK and CLI for deterministic local terminology canonicalization.
It is the zero-friction entrypoint for SkeinRank: no Governance API, Elasticsearch, RabbitMQ, Celery, Docker, OpenRouter token, or ML dependencies are required.
30-second demo
import skeinrank
print(skeinrank.canonicalize("k8s pg timeout"))
# kubernetes postgresql timeout
print(skeinrank.extract("sev1 on kube after deploy"))
# ['critical incident', 'kubernetes', 'deployment']
The module-level helpers use a built-in platform_ops_demo dictionary so the first call works without a file. The demo dictionary is small enough to inspect, but expressive enough to show infrastructure, incidents, CI/CD, search, RAG, and context-shaped company language.
The same built-in dictionary also demonstrates why context matters:
import skeinrank
print(skeinrank.canonicalize("pg timeout"))
# postgresql timeout
print(skeinrank.canonicalize("pg layout"))
# page layout
print(skeinrank.canonicalize("pg dashboard"))
# product group
CLI from a source checkout:
poetry run skeinrank canonicalize "k8s pg timeout" --text
poetry run skeinrank extract "sev1 on kube after deploy" --text --compact
Install from a checkout
cd packages/skeinrank-core
poetry install
poetry run pytest -q
Legacy reranking modules remain in the source tree for compatibility, but the package no longer exposes heavyweight ML install extras. The local SDK facade, demo dictionary, CLI, and document helpers do not require ML dependencies.
Public Python facade
Use SkeinRank when you want to pass a dictionary in code:
from skeinrank import SkeinRank
sr = SkeinRank({
"kubernetes": ["k8s", "kube", "kuber"],
"postgresql": ["pg", "postgres", "psql"],
})
print(sr.canonicalize("kuber timeout on pg"))
# kubernetes timeout on postgresql
print(sr.extract("kuber timeout on pg"))
# ['kubernetes', 'postgresql']
Use explain=True when you need offsets, slots, and highlighted evidence:
result = sr.extract("k8s rollout uses pg", explain=True)
print(result.canonical_values)
print(result.matches[0].alias)
print(result.matches[0].highlighted_fragment)
The same facade can load a full SkeinRank dictionary JSON/YAML file:
from skeinrank import SkeinRank
sr = SkeinRank.from_file("company.dictionary.yaml")
print(sr.canonicalize("k8s rollout uses pg database"))
Built-in demo dictionary and examples
The built-in platform_ops_demo dictionary contains more than 30 canonical terms and more than 80 aliases across platform operations, incidents, CI/CD, search, RAG, and SkeinRank concepts. It is intentionally not a production vocabulary; it is a compact first-touch dictionary for demos, tests, tutorials, and screenshots.
Useful demo phrases:
| Input | Output |
|---|---|
k8s pg timeout |
kubernetes postgresql timeout |
sev1 on kube after pg migration |
critical incident on kubernetes after postgresql database migration |
gha deploy hit rmq latency spike |
github actions, deployment, message queue, latency |
pg layout |
page layout |
pg dashboard |
product group |
Examples live in ../../examples/sdk:
zero_friction_demo.pyruns the facade from Python.platform_ops_demo.dictionary.jsonexports the built-in dictionary in the public dictionary shape.
Dictionary-first SDK
The lower-level dictionary SDK remains available for callers that already use the governance export or skeinrank-migrate dictionary shape.
from skeinrank import load_dictionary, extract_terms, canonicalize_text
dictionary = load_dictionary("../../examples/migration/console_dictionary.example.json")
result = extract_terms(
"This instruction helps deploy 500 k8s servers backed by Postgres.",
dictionary=dictionary,
)
print(result.canonical_values) # ['kubernetes', 'postgresql']
canonicalized = canonicalize_text(
"k8s rollout uses pg database",
dictionary=dictionary,
)
print(canonicalized.text) # kubernetes rollout uses postgresql database
Stable dictionary exports include:
SkeinRank,canonicalize(...),extract(...),demo_dictionary(...),demo_dictionary_payload(...)Dictionary,DictionaryTerm,DictionaryAlias,DictionaryStopListEntryload_dictionary(...),validate_dictionary(...)extract_terms(...),canonicalize_text(...)ExtractionResult,TermMatch,CanonicalizedTextDictionaryDraft,DraftCandidate,DraftFinding,EvidenceSnippetDictionarySuggestionConfig,DictionarySuggestionResult,suggest_dictionary(...),suggest_dictionary_from_documents(...)TerminologyDriftReport,DriftFinding,DriftEvidence,DriftSeverity,DriftFindingType
The matcher is deterministic and local. It honors active/deprecated term and alias statuses, profile/global stop lists, returns offsets, and includes evidence snippets with <mark>...</mark> highlights.
Document text extraction utilities
Local document helpers can extract text before running the SDK matcher. They do not require the Governance API, Elasticsearch, Celery, or a database.
from skeinrank import load_document_text, extract_terms_from_document
text = load_document_text("incident-runbook.md")
result = extract_terms_from_document(
"incident-runbook.md",
dictionary="../../examples/migration/console_dictionary.example.json",
)
print(result.document.file_name)
print(result.extraction.canonical_values)
Supported formats without extra dependencies:
- text-like files:
.txt,.md,.rst,.log,.csv,.tsv,.json,.jsonl,.yaml,.yml .html/.htmwith scripts/styles ignored.docxvia a small stdlib ZIP/XML reader
PDF extraction is supported when the caller installs pypdf in the environment. The core package does not require it by default.
Local terminology drift reports
Compare a dictionary with local documents to see which significant terms are not covered yet. This is a report-only workflow: it does not create proposals, publish snapshots, change bindings, or mutate runtime state.
poetry run skeinrank drift scan \
--dictionary ../../examples/drift-scan/company.dictionary.json \
--docs ../../examples/drift-scan/docs \
--out ../../examples/drift-scan/drift-report.json \
--markdown ../../examples/drift-scan/drift-report.md
The report uses the versioned TerminologyDriftReport schema and includes alias_drift findings for uncovered terminology, stale_term findings for dictionary entries that no longer appear in the scanned corpus, optional binding_lag findings for pinned-vs-latest snapshot metadata, and conservative ambiguity_signal findings when an existing short alias appears in unfamiliar contexts. It also includes evidence snippets and unknown_alias_rate. It is intentionally a local terminology drift report, not a real-time monitor or search observability system.
Ambiguity signals are review hints, not automatic meaning changes. Disable them with --no-ambiguity-signals when you only want uncovered aliases, stale terms, and binding lag.
Add optional binding metadata when you want the report to show snapshot lag without connecting to the Governance API:
poetry run skeinrank drift scan \
--dictionary ../../examples/drift-scan/company.dictionary.json \
--docs ../../examples/drift-scan/docs \
--binding-metadata ../../examples/drift-scan/binding-metadata.json
from skeinrank import DriftScanConfig, scan_dictionary_drift
report = scan_dictionary_drift(
dictionary="company.dictionary.json",
docs=["./docs"],
config=DriftScanConfig(
binding_id="infra_incidents_prod",
pinned_snapshot_version="S42",
latest_snapshot_version="S47",
discovery={"min_frequency": 2},
),
)
print(report.to_markdown())
After review, turn alias-drift findings into a local dictionary draft without mutating production state:
poetry run skeinrank drift export-draft ../../examples/drift-scan/drift-report.json \
--out ../../examples/drift-scan/drift.dictionary-draft.json \
--review ../../examples/drift-scan/drift.dictionary-draft.md
from skeinrank import drift_report_to_dictionary_draft
result = drift_report_to_dictionary_draft("drift-report.json")
print(result.review_markdown())
result.save("drift.dictionary-draft.json")
Only alias_drift findings become draft candidates. Stale terms, binding lag, and ambiguity signals are preserved as review findings so a human can decide whether to create dictionary proposals, context rules, or rollout tasks later.
See ../../docs/guides/terminology-drift-report.md and ../../examples/drift-scan for the complete local workflow, Python examples, report fields, and safety boundary.
Local CLI
Validate a dictionary exported from the governance API or used by skeinrank-migrate:
poetry run skeinrank validate-dictionary ../../examples/migration/console_dictionary.example.json
poetry run skeinrank validate-dictionary ../../examples/migration/console_dictionary.example.yaml --json
Run zero-config demo extraction/canonicalization:
poetry run skeinrank extract "k8s rollout uses pg database" --text --compact
poetry run skeinrank canonicalize "k8s rollout uses pg database" --text
Print or export the built-in demo dictionary:
poetry run skeinrank demo-dictionary --compact
poetry run skeinrank demo-dictionary --output ../../examples/sdk/platform_ops_demo.dictionary.json
Convert existing term lists into a SkeinRank dictionary candidate:
poetry run skeinrank import-dictionary ../../examples/import-dictionary/company_terms.csv \
--name platform_ops_import \
--out ../../examples/import-dictionary/company_terms.dictionary.json
poetry run skeinrank import-dictionary ../../examples/import-dictionary/es_synonyms.txt \
--format es-synonyms \
--name platform_ops_import \
--out ../../examples/import-dictionary/es_synonyms.dictionary.json
The import path accepts simple JSON dictionaries, CSV files with canonical/alias columns, and Elasticsearch/OpenSearch synonym-list files. It writes a local candidate dictionary and prints a review report; it does not mutate governance state, snapshots, bindings, or runtime search.
The review report also runs the imported candidate through the same lightweight dictionary validator used by validate-dictionary. Validator findings are surfaced in the import report so risky aliases, runtime collisions, and short ambiguous forms can be reviewed before the candidate is used. Use --no-validate when you only want a raw conversion report, or --strict-validate when validator errors should block the generated file.
Write a reviewable draft when the imported file should go through an explicit human review step before becoming a runtime dictionary:
poetry run skeinrank import-dictionary ../../examples/import-dictionary/es_synonyms.txt \
--format es-synonyms \
--name platform_ops_import \
--draft-out ../../examples/import-dictionary/es_synonyms.dictionary-draft.json
Drafts keep imported candidates in proposed status. In Python, reviewers can inspect the draft, accept candidates, and only then explicitly export a runtime dictionary:
from skeinrank import DictionaryDraft
draft = DictionaryDraft.from_file("company.dictionary-draft.json")
print(draft.review_markdown())
runtime_dictionary = draft.accept_all().to_dictionary()
Suggest a reviewable draft directly from local documents when there is no dictionary yet:
poetry run skeinrank suggest-dictionary ../../examples/suggest-dictionary/docs \
--profile-name platform_candidates \
--min-frequency 2 \
--out ../../examples/suggest-dictionary/platform_candidates.dictionary-draft.json \
--review ../../examples/suggest-dictionary/platform_candidates.review.md
The suggestion path is deterministic and local. It uses the same candidate discovery engine described below, filters known dictionary terms when --dictionary is provided, and keeps all suggestions in proposed status for review.
Optionally ask OpenRouter to group and name the deterministic candidates. The assistant receives only evidence-backed candidate summaries, not production credentials or runtime state, and returns a reviewable draft. Runtime canonicalization remains deterministic after review:
export OPENROUTER_API_KEY="..."
export OPENROUTER_MODEL="provider/model"
poetry run skeinrank assist-dictionary ../../examples/agent-dictionary-assistant/docs \
--model "$OPENROUTER_MODEL" \
--profile-name platform_assisted_terms \
--out ../../examples/agent-dictionary-assistant/platform_assisted.dictionary-draft.json \
--review ../../examples/agent-dictionary-assistant/platform_assisted.review.md
The OpenRouter-assisted path does not publish snapshots, mutate bindings, or write runtime dictionaries automatically. It only improves a local draft for human review.
Detailed guides and runnable examples:
- Import existing dictionaries and examples/import-dictionary for CSV, JSON, and Elasticsearch/OpenSearch synonym lists.
- Agent dictionary assistant, examples/suggest-dictionary, and examples/agent-dictionary-assistant for deterministic and optional OpenRouter-assisted draft creation.
Run the example script:
poetry run python ../../examples/sdk/zero_friction_demo.py
Run against a specific dictionary file:
poetry run skeinrank extract "k8s rollout uses pg database" \
--text \
--dictionary ../../examples/migration/console_dictionary.example.json
poetry run skeinrank canonicalize incident-runbook.md \
--dictionary ../../examples/migration/console_dictionary.example.json \
--output incident-runbook.canonicalized.txt
Extract plain text from a document before matching:
poetry run skeinrank document-text incident-runbook.docx --output incident-runbook.txt
The CLI returns JSON for extract, raw text by default for canonicalize and document-text, and supports --output, --compact, --max-matches, and --context-chars where relevant.
Candidate discovery engine
The core package also includes a deterministic candidate discovery engine for cold-start dictionary suggestions and future terminology drift reports. It scans local text, filters known dictionary terms, ranks unmatched technical candidates, and returns evidence snippets for review.
from skeinrank import CandidateDiscoveryConfig, discover_candidates, demo_dictionary
report = discover_candidates(
[
{"source": "incident-1.md", "text": "Kubelet OOM after pg migration"},
{"source": "incident-2.md", "text": "Kubelet OOM returned during deploy"},
],
dictionary=demo_dictionary(),
config=CandidateDiscoveryConfig(min_frequency=2),
)
for candidate in report.top_candidates(5):
print(candidate.value, candidate.mention_count, candidate.evidence[0].text)
Candidate discovery does not create runtime terminology, mutate snapshots, or publish bindings. It is a shared local engine that later workflows can use to build reviewable drafts, import reports, and drift scans.
Build a reviewable draft from documents in Python:
from skeinrank import suggest_dictionary_from_documents
result = suggest_dictionary_from_documents(
["../../examples/suggest-dictionary/docs"],
config={
"profile_name": "platform_candidates",
"discovery": {"min_frequency": 2},
},
)
result.save("platform_candidates.dictionary-draft.json")
print(result.review_markdown())
The draft is a local review artifact, not a production dictionary. Reviewers can accept or reject candidates and explicitly convert accepted candidates for preview when needed.
Use OpenRouter as an optional grouping layer after deterministic discovery:
import os
from skeinrank import build_dictionary_from_docs
result = build_dictionary_from_docs(
["../../examples/agent-dictionary-assistant/docs"],
model=os.environ["OPENROUTER_MODEL"],
)
result.save("platform_assisted.dictionary-draft.json")
print(result.review_markdown())
Every assistant candidate must map back to deterministic local evidence. Aliases without evidence are dropped, and candidates without evidence are ignored.
Terminology drift report schema
The core package exposes a versioned terminology drift report schema for future drift scans and governance review flows. It is intentionally data-only: creating or saving a report does not scan documents, create proposals, publish snapshots, update bindings, or mutate production runtime state.
from skeinrank import (
DriftEvidence,
DriftFinding,
DriftFindingType,
DriftSeverity,
TerminologyDriftReport,
)
report = TerminologyDriftReport(
profile_name="infra_incidents",
binding_id="infra_incidents_prod",
pinned_snapshot_version="S42",
latest_snapshot_version="S47",
metrics={"unknown_alias_rate": 0.118},
findings=[
DriftFinding(
finding_type=DriftFindingType.ALIAS_DRIFT,
severity=DriftSeverity.WARN,
title="New candidate alias detected",
value="kubelet oom",
evidence=[
DriftEvidence(
source="incident-1.md",
line=7,
text="Kubelet OOM after the node pool upgrade.",
)
],
)
],
)
print(report.summary().unknown_alias_rate)
print(report.to_markdown())
report.save("terminology-drift-report.json")
The schema currently covers review signals for new unmatched aliases, stale terms, binding snapshot lag, and ambiguity signals. Later scanner commands can emit this report shape while keeping the same review-first principle: detect automatically, approve manually, serve deterministically.
Attribute extraction and enrichment
The older attribute/profile API is still available for advanced local enrichment workflows.
from skeinrank import build_attribute_profile, enrich_texts
profile = build_attribute_profile(
profile_id="company_terms",
aliases={
"kubernetes": ["k8s", "kube", "kuber"],
"postgresql": ["pg", "postgres", "psql"],
},
slots={
"kubernetes": "TOOL",
"postgresql": "DB",
},
snapshot_version="company_terms@v1",
)
rows = enrich_texts(
[
{"id": "doc-1", "text": "k8s timeout after upgrade"},
{"id": "doc-2", "text": "pg latency spike"},
],
profile=profile,
)
print(rows[0]["canonical_values"])
Use this layer when you need profile templates, fuzzy alias fallback, richer passport/debug traces, or JSONL enrichment helpers.
Publishing checklist
The package is published through the manual publish-skeinrank-core GitHub Actions workflow. The recommended flow is:
- Build and test locally.
- Publish to TestPyPI.
- Install from TestPyPI in a clean environment.
- Publish to PyPI only after the TestPyPI smoke test passes.
Local packaging checks:
poetry install
poetry run pytest -q
poetry build
poetry run python -m pip install --upgrade twine
poetry run twine check dist/*
See docs/PUBLISHING.md for the full release checklist.
Public API policy
Only symbols re-exported from skeinrank.__init__ should be treated as stable public API. Internal modules may change without notice.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skeinrank-0.11.0.tar.gz.
File metadata
- Download URL: skeinrank-0.11.0.tar.gz
- Upload date:
- Size: 108.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7ff2fe9c76a288b146ba896912b2238f14bf1e56d0f404bf8b639ebb6dee55d
|
|
| MD5 |
e824bf5f136e00c77a9dd2c096e932ac
|
|
| BLAKE2b-256 |
a988e06ad3212286e17afb22493ef86b30f4774c090989e1bb7393c206fb1ec5
|
Provenance
The following attestation bundles were made for skeinrank-0.11.0.tar.gz:
Publisher:
publish-skeinrank-core.yml on SkeinRank/skeinrank
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
skeinrank-0.11.0.tar.gz -
Subject digest:
f7ff2fe9c76a288b146ba896912b2238f14bf1e56d0f404bf8b639ebb6dee55d - Sigstore transparency entry: 1754793431
- Sigstore integration time:
-
Permalink:
SkeinRank/skeinrank@d47566c2df3c42a009c231e060b7e3950efa6d95 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/SkeinRank
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-skeinrank-core.yml@d47566c2df3c42a009c231e060b7e3950efa6d95 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file skeinrank-0.11.0-py3-none-any.whl.
File metadata
- Download URL: skeinrank-0.11.0-py3-none-any.whl
- Upload date:
- Size: 126.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6043e577b466665dd6e147b7e5ca56d058801ec8c9d3b28ef53e3d48f96a17f
|
|
| MD5 |
597f5bca78af39dcc8dfd334092b712f
|
|
| BLAKE2b-256 |
cf79115536a924b6b53b173f725e1d288c687970e5eda33e6f4939aa17ddcd91
|
Provenance
The following attestation bundles were made for skeinrank-0.11.0-py3-none-any.whl:
Publisher:
publish-skeinrank-core.yml on SkeinRank/skeinrank
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
skeinrank-0.11.0-py3-none-any.whl -
Subject digest:
e6043e577b466665dd6e147b7e5ca56d058801ec8c9d3b28ef53e3d48f96a17f - Sigstore transparency entry: 1754793460
- Sigstore integration time:
-
Permalink:
SkeinRank/skeinrank@d47566c2df3c42a009c231e060b7e3950efa6d95 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/SkeinRank
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-skeinrank-core.yml@d47566c2df3c42a009c231e060b7e3950efa6d95 -
Trigger Event:
workflow_dispatch
-
Statement type: