Skip to main content

Terminology linter for a given subject area text

Project description

termlint

PyPI version License Python versions CI

Terminology linter for projects — extracts terms from code/docs and verifies coverage against your glossary/ontology.

What Is termlint?

termlint is a CLI tool for terminology quality checks in text/documentation workflows.

  • extracts term candidates from text
  • verifies terms against your glossary (exact/fuzzy)
  • generates JSON reports (verification, ontology_update, quality_gate, extraction)
  • helps bootstrap and evolve glossaries (glossary from-report, glossary merge)

Concept

Raw Text → Parallel Extractors → Async Pipeline → Glossary Match → Quality Report
  ↓        (rules,cvalue,keybert)   (norm,filter,rank)     ↓
TextEntityStream ────────────────────────→ Coverage 90%

Async functional pipeline with composable stages and universal TextEntity model.

Alpha Status

termlint is currently alpha.

Implemented and supported now:

  • rule-based extraction (RuleExtractor / spaCy)
  • verification: exact, fuzzy
  • report export: JSON (extraction, verification, ontology_update, quality_gate)
  • glossary tooling: glossary from-report, glossary merge

Planned / not implemented yet:

  • extractors: CValue, KeyBERT
  • processing stages: filter, rank
  • verification stages: semantic, ensemble
  • exporters: HTML, JUnit

Compatibility Matrix

Dimension Current support
OS Linux, macOS, Windows (CLI, JSON workflows)
Python 3.12.x
Required extras termlint[base]
Core deps from extras spacy, rapidfuzz
Default spaCy model ru_core_news_sm
Console/output language English-only CLI and report metadata
Tested text languages Russian (ru_core_news_sm), English (en_core_web_sm)
Other languages Possible via rules.model, but not yet validated in the alpha test matrix

Language Support Policy

  • termlint pipeline is language-agnostic in design, but extraction quality depends on the selected spaCy model.
  • Officially tested in alpha:
    • Russian with ru_core_news_sm
    • English with en_core_web_sm
  • Other spaCy language models can be used via [tool.termlint.extraction.rules].model, but should be treated as experimental until formally tested.
  • CLI messages and generated report metadata are in English.

Quick Start

  1. Install:
# Recommended for CLI usage (isolated global tool)
pipx install "termlint[base]"

# Alternative: project/venv install
pip install --pre "termlint[base]"

# Install spaCy model into the same environment
python -m spacy download en_core_web_sm

For pipx, install model inside the pipx environment:

pipx runpip termlint install en-core-web-sm
# or for Russian
pipx runpip termlint install ru-core-news-sm
  1. Create a minimal glossary (glossary.json):
[
  { "id": "ml:001", "label": "machine learning", "synonyms": ["ML"] },
  { "id": "ml:002", "label": "artificial intelligence", "synonyms": ["AI"] }
]
  1. Create an input text file (input.txt):
Artificial intelligence and machine learning are used in data analytics.
  1. Run verification:
termlint verify input.txt --source glossary.json --verifier fuzzy --threshold 85
  1. Expected output (example):
Files     ... 100%
✅ input.txt ... 100%
📊 Coverage: 33.3% (2/6)
⚠️  Quality Gate would FAIL in CI mode

Generated reports:

  • reports/verification.json
  • reports/ontology_update.json
  • reports/quality_gate.json

Exit behavior:

  • verify typically exits 0 on successful run (even if quality gate would fail in CI mode)
  • ci exits 1 when quality gates fail
  • full contract is listed in Exit Codes

Glossary JSON Schema

termlint expects a glossary file as a JSON array of objects.

Required fields per entity:

  • id (string)
  • label (string)

Optional fields:

  • synonyms (string[], default [])
  • relations (object<string, string[]>, default {})
  • definition (string | null)
  • source (string | null)

Minimal valid example:

[
  {
    "id": "ml:001",
    "label": "machine learning"
  }
]

Extended example:

[
  {
    "id": "ml:001",
    "label": "machine learning",
    "synonyms": ["ML"],
    "relations": {
      "related_to": ["ml:002"]
    },
    "definition": "Field focused on learning patterns from data.",
    "source": "internal-glossary"
  }
]

Common validation/runtime errors:

  • File not found: Glossary file not found: <path>
  • Invalid JSON syntax: Invalid JSON in <path>: ...
  • Invalid entity shape/type: Failed to initialize glossary source '<path>': ...

Glossary Tooling

Create glossary from ontology_update report:

termlint glossary from-report \
  --report reports/ontology_update.json \
  --out glossary.generated.json \
  --min-score 0.7 \
  --min-frequency 1 \
  --namespace auto

Merge generated glossary into an existing glossary:

termlint glossary merge \
  --base glossary.json \
  --updates glossary.generated.json \
  --out glossary.merged.json \
  --on-match merge-synonyms \
  --on-conflict report \
  --conflicts-out merge.conflicts.json \
  --summary-out merge.summary.json

Development

poetry config virtualenvs.in-project true --local
poetry env use python3.12
poetry install --with dev --extras "base"

Logging

termlint follows common linter-style verbosity controls:

termlint -v verify <file>        # INFO logs
termlint -vv verify <file>       # DEBUG logs
termlint -q verify <file>        # ERROR only
termlint --log-level DEBUG verify <file>
termlint --log-file reports/termlint.log verify <file>
termlint --config ./pyproject.toml verify <file> --source ./glossary.json

You can also set defaults in pyproject.toml:

[tool.termlint.logging]
level = "WARNING"
log_file = "reports/termlint.log"
fmt = "%(asctime)s [%(name)s] %(levelname)-8s %(message)s"
datefmt = "%Y-%m-%d %H:%M:%S"
max_bytes = 10485760
backup_count = 5

spaCy model download is disabled by default during lint runs. Configure extraction like:

[tool.termlint.extraction]
extractors = ["rule"]
rules = { model = "en_core_web_sm", auto_download_model = false }

Set auto_download_model = true only if you explicitly want runtime model download (not recommended for CI).

Config Discovery

Config lookup order:

  1. --config <PATH>
  2. nearest pyproject.toml (searching upward from current directory), section [tool.termlint]
  3. user-level config:
    • $XDG_CONFIG_HOME/termlint/config.toml (if set)
    • ~/.config/termlint/config.toml
    • %APPDATA%/termlint/config.toml (Windows)
    • ~/.termlint/config.toml
  4. built-in defaults

User-level config may use either:

  • [tool.termlint] (same as project config)
  • [termlint] (short form for standalone user config files)

Exit Codes

termlint uses a stable exit code contract:

  • 0: successful run
  • 1: quality gate failed (ci command)
  • 2: usage/configuration error (invalid options/config/source)
  • 3: internal pipeline/runtime error

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

termlint-0.1.0a2.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

termlint-0.1.0a2-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file termlint-0.1.0a2.tar.gz.

File metadata

  • Download URL: termlint-0.1.0a2.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for termlint-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 b8b3208e7a84d63d51fb7d8ae0c5399981dbc825c8893d9a54d1ca83b7d6c461
MD5 b15c6935d64d8dd87cc5abcd748c1ad0
BLAKE2b-256 e62dd56a6682d65e1052ce359a153d15aca614cf5d64d7d7126997e278261b50

See more details on using hashes here.

File details

Details for the file termlint-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: termlint-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for termlint-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 127cc92fc90a83aadd867dc2104acef4804cdb8f2c1052baf5caa11872ce615e
MD5 ec1b7c9bc1da01c6b93660291f0b7655
BLAKE2b-256 3998146513f57eaece23c1c6b6b24cedeca2f70f5e665fe38bca9253c9aaf3cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page