Skip to main content

Terminology linter for a given subject area text

Project description

termlint

Terminology linter for projects — extracts terms from code/docs and verifies coverage against your glossary/ontology.

What Is termlint?

termlint is a CLI tool for terminology quality checks in text/documentation workflows.

  • extracts term candidates from text
  • verifies terms against your glossary (exact/fuzzy)
  • generates JSON reports (verification, ontology_update, quality_gate, extraction)
  • helps bootstrap and evolve glossaries (glossary from-report, glossary merge)

Concept

Raw Text → Parallel Extractors → Async Pipeline → Glossary Match → Quality Report
  ↓        (rules,cvalue,keybert)   (norm,filter,rank)     ↓
TextEntityStream ────────────────────────→ Coverage 90%

Async functional pipeline with composable stages and universal TextEntity model.

Alpha Status

termlint is currently alpha.

Implemented and supported now:

  • rule-based extraction (RuleExtractor / spaCy)
  • verification: exact, fuzzy
  • report export: JSON (extraction, verification, ontology_update, quality_gate)
  • glossary tooling: glossary from-report, glossary merge

Planned / not implemented yet:

  • extractors: CValue, KeyBERT
  • processing stages: filter, rank
  • verification stages: semantic, ensemble
  • exporters: HTML, JUnit

Compatibility Matrix

Dimension Current support
OS Linux, macOS, Windows (CLI, JSON workflows)
Python 3.12.x
Required extras termlint[base]
Core deps from extras spacy, rapidfuzz
Default spaCy model ru_core_news_sm
Console/output language English-only CLI and report metadata
Tested text languages Russian (ru_core_news_sm), English (en_core_web_sm)
Other languages Possible via rules.model, but not yet validated in the alpha test matrix

Language Support Policy

  • termlint pipeline is language-agnostic in design, but extraction quality depends on the selected spaCy model.
  • Officially tested in alpha:
    • Russian with ru_core_news_sm
    • English with en_core_web_sm
  • Other spaCy language models can be used via [tool.termlint.extraction.rules].model, but should be treated as experimental until formally tested.
  • CLI messages and generated report metadata are in English.

Quick Start

  1. Install:
# Recommended for CLI usage (isolated global tool)
pipx install "termlint[base]"

# Alternative: project/venv install
pip install --pre "termlint[base]"

# Install spaCy model into the same environment
python -m spacy download en_core_web_sm

For pipx, install model inside the pipx environment:

pipx runpip termlint install en-core-web-sm
# or for Russian
pipx runpip termlint install ru-core-news-sm
  1. Create a minimal glossary (glossary.json):
[
  { "id": "ml:001", "label": "machine learning", "synonyms": ["ML"] },
  { "id": "ml:002", "label": "artificial intelligence", "synonyms": ["AI"] }
]
  1. Create an input text file (input.txt):
Artificial intelligence and machine learning are used in data analytics.
  1. Run verification:
termlint verify input.txt --source glossary.json --verifier fuzzy --threshold 85
  1. Expected output (example):
Files     ... 100%
✅ input.txt ... 100%
📊 Coverage: 33.3% (2/6)
⚠️  Quality Gate would FAIL in CI mode

Generated reports:

  • reports/verification.json
  • reports/ontology_update.json
  • reports/quality_gate.json

Exit behavior:

  • verify typically exits 0 on successful run (even if quality gate would fail in CI mode)
  • ci exits 1 when quality gates fail
  • full contract is listed in Exit Codes

Glossary JSON Schema

termlint expects a glossary file as a JSON array of objects.

Required fields per entity:

  • id (string)
  • label (string)

Optional fields:

  • synonyms (string[], default [])
  • relations (object<string, string[]>, default {})
  • definition (string | null)
  • source (string | null)

Minimal valid example:

[
  {
    "id": "ml:001",
    "label": "machine learning"
  }
]

Extended example:

[
  {
    "id": "ml:001",
    "label": "machine learning",
    "synonyms": ["ML"],
    "relations": {
      "related_to": ["ml:002"]
    },
    "definition": "Field focused on learning patterns from data.",
    "source": "internal-glossary"
  }
]

Common validation/runtime errors:

  • File not found: Glossary file not found: <path>
  • Invalid JSON syntax: Invalid JSON in <path>: ...
  • Invalid entity shape/type: Failed to initialize glossary source '<path>': ...

Glossary Tooling

Create glossary from ontology_update report:

termlint glossary from-report \
  --report reports/ontology_update.json \
  --out glossary.generated.json \
  --min-score 0.7 \
  --min-frequency 1 \
  --namespace auto

Merge generated glossary into an existing glossary:

termlint glossary merge \
  --base glossary.json \
  --updates glossary.generated.json \
  --out glossary.merged.json \
  --on-match merge-synonyms \
  --on-conflict report \
  --conflicts-out merge.conflicts.json \
  --summary-out merge.summary.json

Development

poetry config virtualenvs.in-project true --local
poetry env use python3.12
poetry install --with dev --extras "base"

Logging

termlint follows common linter-style verbosity controls:

termlint -v verify <file>        # INFO logs
termlint -vv verify <file>       # DEBUG logs
termlint -q verify <file>        # ERROR only
termlint --log-level DEBUG verify <file>
termlint --log-file reports/termlint.log verify <file>
termlint --config ./pyproject.toml verify <file> --source ./glossary.json

You can also set defaults in pyproject.toml:

[tool.termlint.logging]
level = "WARNING"
log_file = "reports/termlint.log"
fmt = "%(asctime)s [%(name)s] %(levelname)-8s %(message)s"
datefmt = "%Y-%m-%d %H:%M:%S"
max_bytes = 10485760
backup_count = 5

spaCy model download is disabled by default during lint runs. Configure extraction like:

[tool.termlint.extraction]
extractors = ["rule"]
rules = { model = "en_core_web_sm", auto_download_model = false }

Set auto_download_model = true only if you explicitly want runtime model download (not recommended for CI).

Config Discovery

Config lookup order:

  1. --config <PATH>
  2. nearest pyproject.toml (searching upward from current directory), section [tool.termlint]
  3. user-level config:
    • $XDG_CONFIG_HOME/termlint/config.toml (if set)
    • ~/.config/termlint/config.toml
    • %APPDATA%/termlint/config.toml (Windows)
    • ~/.termlint/config.toml
  4. built-in defaults

User-level config may use either:

  • [tool.termlint] (same as project config)
  • [termlint] (short form for standalone user config files)

Exit Codes

termlint uses a stable exit code contract:

  • 0: successful run
  • 1: quality gate failed (ci command)
  • 2: usage/configuration error (invalid options/config/source)
  • 3: internal pipeline/runtime error

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

termlint-0.1.0a1.tar.gz (33.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

termlint-0.1.0a1-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file termlint-0.1.0a1.tar.gz.

File metadata

  • Download URL: termlint-0.1.0a1.tar.gz
  • Upload date:
  • Size: 33.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for termlint-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 240f99b98ba36acaf9a604e3bb4aba932545dbf6eb088c231f323769daf380c5
MD5 2f8ac3b2d297086a21da0fb8de3e46b9
BLAKE2b-256 7a20f60bd67d9aa01d97191cccc652fd066f54709a08e76684d78c63d673f36c

See more details on using hashes here.

File details

Details for the file termlint-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: termlint-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 45.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for termlint-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 fb5bd44fbbb5add5e019524d97670c4088bfe4f1da71880fe11bde9c7e8b34d2
MD5 fcaadca7bb0242865135ae345d9a2838
BLAKE2b-256 bdafbcc63f74eff45055aee742d012413e05adbd7331d397e0f7c5c8e7234128

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page