Dictionary-first extraction, alias canonicalization, and lightweight reranking utilities for technical documents.

These details have not been verified by PyPI

Project links

Project description

skeinrank-core

skeinrank-core is the stable Python core for SkeinRank.

It provides two practical capabilities:

profile-based reranking with passports and strict contracts
rule-based attribute extraction and normalization for technical text

Install

python -m pip install -e .

Optional extras are available, but they are not required for the current MVP.

Public Python SDK API

Patch 41 adds a lightweight dictionary-first SDK API that can be used without running the governance API, Elasticsearch, Celery, or the UI. It accepts the same dictionary JSON shape exported by the User Console API and used by skeinrank-migrate.

from skeinrank import load_dictionary, extract_terms, canonicalize_text

dictionary = load_dictionary("../../examples/migration/console_dictionary.example.json")

result = extract_terms(
    "This instruction helps deploy 500 k8s servers backed by Postgres.",
    dictionary=dictionary,
)

print(result.canonical_values)  # ["kubernetes", "postgresql"]
print(result.matches[0].highlighted_fragment)

canonicalized = canonicalize_text(
    "k8s rollout uses pg database",
    dictionary=dictionary,
)
print(canonicalized.text)  # "kubernetes rollout uses postgresql database"

The stable SDK exports:

Dictionary, DictionaryTerm, DictionaryAlias, DictionaryStopListEntry
load_dictionary(...)
validate_dictionary(...)
extract_terms(...)
canonicalize_text(...)
ExtractionResult, TermMatch, CanonicalizedText

The SDK matcher is deterministic and local. It honors active/deprecated term and alias statuses, profile/global stop lists, returns offsets, and includes evidence snippets with <mark>...</mark> highlights.

Document text extraction utilities

Patch 42 adds lightweight local helpers for extracting text from common document files before running the public SDK matcher. These helpers do not require the governance API, Elasticsearch, Celery, or a database.

from skeinrank import load_dictionary, load_document_text, extract_terms_from_document

dictionary = load_dictionary("../../examples/migration/console_dictionary.example.json")
text = load_document_text("incident-runbook.md")

result = extract_terms_from_document(
    "incident-runbook.md",
    dictionary=dictionary,
)

print(result.document.file_name)
print(result.extraction.canonical_values)

Supported formats without extra dependencies:

text-like files: .txt, .md, .rst, .log, .csv, .tsv, .json, .jsonl, .yaml, .yml
.html / .htm with scripts/styles ignored
.docx via a small stdlib ZIP/XML reader

PDF extraction is supported when the caller installs pypdf in the environment. The core package does not require it by default so the SDK stays lightweight.

Stable public exports include:

DocumentText, DocumentExtractionResult, DocumentExtractionError
load_document_text(...)
extract_document_text(...)
extract_terms_from_document(...)

Local dictionary extraction CLI

Patch 43 adds a lightweight skeinrank CLI for local dictionary validation, text/document extraction, and canonicalization. It uses only the public SDK/document helpers and does not require the governance API, Elasticsearch, Celery, RabbitMQ, or a database.

Validate a dictionary exported from the Console API or used by skeinrank-migrate:

poetry run skeinrank validate-dictionary ../../examples/migration/console_dictionary.example.json
poetry run skeinrank validate-dictionary ../../examples/migration/console_dictionary.example.json --json

Extract canonical terms from raw text:

poetry run skeinrank extract "k8s rollout uses pg database" \
  --text \
  --dictionary ../../examples/migration/console_dictionary.example.json

Extract canonical terms from a supported local document:

poetry run skeinrank extract incident-runbook.md \
  --dictionary ../../examples/migration/console_dictionary.example.json

Canonicalize raw text or document text:

poetry run skeinrank canonicalize "k8s rollout uses pg database" \
  --text \
  --dictionary ../../examples/migration/console_dictionary.example.json

poetry run skeinrank canonicalize incident-runbook.md \
  --dictionary ../../examples/migration/console_dictionary.example.json \
  --output incident-runbook.canonicalized.txt

Extract plain text from a document before matching:

poetry run skeinrank document-text incident-runbook.docx --output incident-runbook.txt

The CLI returns JSON for extract, raw text by default for canonicalize/document-text, and supports --output, --compact, --max-matches, and --context-chars where relevant.

PyPI/TestPyPI publishing

Patch 44 adds publishing polish for the lightweight skeinrank package. The recommended flow is:

Build and test locally.
Publish to TestPyPI.
Install from TestPyPI in a clean environment.
Publish to PyPI only after the TestPyPI smoke test passes.

Local packaging checks:

poetry install
poetry run pytest -q
poetry build
poetry run python -m pip install --upgrade twine
poetry run twine check dist/*

The manual GitHub Actions workflow is publish-skeinrank-core. It defaults to dry_run=true, supports testpypi and pypi targets, and uses PyPI Trusted Publishing for the actual upload step.

PDF extraction support stays optional. Install pypdf separately when needed:

pip install pypdf

See docs/PUBLISHING.md for the full release checklist.

Minimal attribute extraction example

from skeinrank import extract_attributes

pack = extract_attributes(
    "K8s api-server crashloop on version 1.28",
    profile="default_it",
    debug=True,
)

for item in pack.attributes:
    print(item.slot, item.value)

print(pack.passport)

Custom terminology profile

You can build a profile directly in Python:

from skeinrank import build_attribute_profile, extract_attributes

profile = build_attribute_profile(
    profile_id="company_terms",
    aliases={
        "kubernetes": ["k8s", "kube", "kuber"],
        "postgresql": ["pg", "postgres"],
    },
    slots={
        "kubernetes": "TOOL",
        "postgresql": "DB",
    },
    snapshot_version="company_terms@v1",
)

pack = extract_attributes("kuber timeout on pg", profile=profile)

Or create a starter profile and load it as a JSON snapshot:

poetry run skeinrank-init-profile company_terms.json
poetry run skeinrank-validate-profile company_terms.json

from skeinrank import extract_attributes, load_attribute_profile

profile = load_attribute_profile("company_terms.json")
pack = extract_attributes("kuber timeout on pg", profile=profile)

A profile file can use the compact grouped alias format:

{
  "profile_id": "company_terms",
  "snapshot": {
    "version": "company_terms@v1",
    "source": "file"
  },
  "aliases": [
    {
      "slot": "TOOL",
      "canonical": "kubernetes",
      "aliases": ["k8s", "kube", "kuber"]
    },
    {
      "slot": "DB",
      "canonical": "postgresql",
      "aliases": ["pg", "postgres", "psql"]
    }
  ],
  "rules": []
}

The CLI accepts the same file with --profile-file.

Validate a profile before using it for extraction or enrichment:

poetry run skeinrank-validate-profile company_terms.json
poetry run skeinrank-validate-profile company_terms.json --json
poetry run skeinrank-validate-profile company_terms.json --strict
# Optional: customize short-alias warning threshold
poetry run skeinrank-validate-profile company_terms.json --min-short-alias-length 4

The validation report catches fatal alias collisions and warns about aliases that are likely to hurt retrieval quality, such as overly generic terms (api, service, app) or very short aliases (pg, go, js). It also validates governance statuses (active, deprecated, pending, ambiguous, disabled, rejected) and can elevate warnings to errors in --strict mode before publishing a snapshot.

Optional fuzzy alias fallback

Exact alias matching is the default. Enable fuzzy fallback only when you want to catch typo-like terms:

from skeinrank import extract_attributes, load_attribute_profile

profile = load_attribute_profile("company_terms.json")
pack = extract_attributes(
    "kubernets timeout on postgress",
    profile=profile,
    enable_fuzzy=True,
    fuzzy_threshold=0.88,
)

The same option is available in CLI commands:

poetry run skeinrank-extract \
  --text "kubernets timeout on postgress" \
  --profile-file company_terms.json \
  --enable-fuzzy \
  --fuzzy-threshold 0.88

Fuzzy matching is intentionally conservative: it is disabled by default, ignores short aliases by default, and marks matches as fuzzy_alias in attributes/passport output.

High-level Python enrichment

Use enrich_texts(...) when you want to process a small in-memory corpus without writing the extraction loop yourself.

from skeinrank import build_attribute_profile, enrich_texts

profile = build_attribute_profile(
    profile_id="company_terms",
    aliases={
        "kubernetes": ["k8s", "kube", "kuber"],
        "postgresql": ["pg", "postgres", "psql"],
    },
    slots={
        "kubernetes": "TOOL",
        "postgresql": "DB",
    },
    snapshot_version="company_terms@v1",
)

rows = enrich_texts(
    [
        {"id": "doc-1", "text": "k8s timeout after upgrade"},
        {"id": "doc-2", "text": "pg latency spike"},
    ],
    profile=profile,
)

print(rows[0]["canonical_values"])  # ["kubernetes"]

By default, the result is compact and search-friendly: canonical_values, slots, snapshot_version, and alias_matcher_backend. Use include_attributes=True or include_passport=True when you need explainability/debug output.

Batch enrichment and demo eval

The core package also ships helper functions and product-friendly CLI entrypoints for demo workflows.

From a source checkout:

poetry run skeinrank-extract --text "kube api timeout" --debug

poetry run skeinrank-enrich-jsonl \
  ../../examples/demo/demo_documents.jsonl \
  ../../examples/demo/demo_enriched_documents.jsonl

poetry run skeinrank-eval-demo \
  ../../examples/demo/demo_queries.jsonl \
  ../../examples/demo/demo_enriched_documents.jsonl \
  --out ../../examples/demo/demo_eval_results.json

What the default attribute profile does

The default file-based profile lives under skeinrank/attributes/config/default_it.json and currently implements:

canonical alias mapping (for example k8s -> kubernetes)
versioned snapshot metadata for repeatable enrichment runs
Aho-Corasick alias matching for fast in-memory runtime lookup
regex/rule extraction for selected slots
slot limits and total limits
explainable passport/debug traces

Run tests from a source checkout

poetry run pytest -q

Public API

Only symbols re-exported from skeinrank.__init__ should be treated as stable public API.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.11.0

Jun 8, 2026

0.10.0

Jun 7, 2026

This version

0.0.16

May 10, 2026

0.0.1

Jan 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skeinrank-0.0.16.tar.gz (63.9 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skeinrank-0.0.16-py3-none-any.whl (75.2 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file skeinrank-0.0.16.tar.gz.

File metadata

Download URL: skeinrank-0.0.16.tar.gz
Upload date: May 10, 2026
Size: 63.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for skeinrank-0.0.16.tar.gz
Algorithm	Hash digest
SHA256	`913e61580ad78be72599b6a33fc628a128b54a41ad456ef6f15347f01ec1b897`
MD5	`9ca747f93553e3a5cac0634be856c833`
BLAKE2b-256	`0462f35d7239b8ddb0ff247b0b3fd3842835a390c64f3b401f5fd574fb8c6361`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skeinrank-0.0.16.tar.gz:

Publisher: publish-skeinrank-core.yml on SkeinRank/skeinrank

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skeinrank-0.0.16.tar.gz
- Subject digest: 913e61580ad78be72599b6a33fc628a128b54a41ad456ef6f15347f01ec1b897
- Sigstore transparency entry: 1493387372
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: SkeinRank/skeinrank@6948a9131648c9ca1b142acb6261ff57d6985026
- Branch / Tag: refs/heads/main
- Owner: https://github.com/SkeinRank
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-skeinrank-core.yml@6948a9131648c9ca1b142acb6261ff57d6985026
- Trigger Event: workflow_dispatch

File details

Details for the file skeinrank-0.0.16-py3-none-any.whl.

File metadata

Download URL: skeinrank-0.0.16-py3-none-any.whl
Upload date: May 10, 2026
Size: 75.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for skeinrank-0.0.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e7dbc70e64d05d78a041032e5c3dd9145e57e0711f22cc625a1b511c6c20791`
MD5	`3921161c2122c38b9aeee4e09746d6ec`
BLAKE2b-256	`c01d141cab0103fd0140c2e1b3719d1c4c1b4ecfe65d2ae247248cad0e557227`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skeinrank-0.0.16-py3-none-any.whl:

Publisher: publish-skeinrank-core.yml on SkeinRank/skeinrank

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skeinrank-0.0.16-py3-none-any.whl
- Subject digest: 3e7dbc70e64d05d78a041032e5c3dd9145e57e0711f22cc625a1b511c6c20791
- Sigstore transparency entry: 1493387441
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: SkeinRank/skeinrank@6948a9131648c9ca1b142acb6261ff57d6985026
- Branch / Tag: refs/heads/main
- Owner: https://github.com/SkeinRank
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-skeinrank-core.yml@6948a9131648c9ca1b142acb6261ff57d6985026
- Trigger Event: workflow_dispatch

skeinrank 0.0.16

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

skeinrank-core

Install

Public Python SDK API

Document text extraction utilities

Local dictionary extraction CLI

PyPI/TestPyPI publishing

Minimal attribute extraction example

Custom terminology profile

Optional fuzzy alias fallback

High-level Python enrichment

Batch enrichment and demo eval

What the default attribute profile does

Run tests from a source checkout

Public API

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance