Skip to main content

Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.

Project description

extract-cli

Part of the contract-ops CLI suite. extract-cli is the suite's passport control — the open-loop front door. The rest of the suite is a closed loop that only handles documents it authored from its own templates; extract-cli ingests any document (yours or a counterparty's foreign paper) and emits a structured representation the pipeline can consume: template-vault-cli (storage) feeds draft-cli (fill placeholders) → nda-review-cli (review, redline, negotiate) → docx2pdf-cli (DOCX → PDF) → sign-cli (signing + audit). Cross-version drift detection via compare-cli. Showcase site.

extract-cli sits upstream of review: it turns foreign paper into the suite's canonical, structured vocabulary. Its output is a cross-CLI data contract — see docs/INTEROP.md and docs/spec/extract-output.schema.json.

ingest (extract) → review → diff → convert → sign
   ^you are here

Run this

pipx run extract-cli demo        # zero-config: extract a bundled NDA → structured JSON
# or, installed:  pip install extract-cli && extract demo

That prints the full output contract — parties, dates, term, governing law, and a clause map normalized onto the suite's canonical vocabulary — for a bundled fixture, with no setup and no network. Point it at your own file with extract path/to/contract.docx.

Where to go next

What it does

Give it a contract in .md / .txt / .html (native), .docx, or .pdf, and it returns structured JSON: the parties, dates, term, governing law, a clause map normalized onto the suite's canonical clause vocabulary, a defined-term inventory, and a headline value. Every field carries a confidence and a source so downstream tools verify, don't trust.

It is stdlib-only, single-file, terminal-first, and composable. No DB, no daemon, no network in the default path.

Install

pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
pip install "extract-cli[docx,pdf]"     # both

The core has zero runtime dependencies and is fully functional on .md/.txt/.html with no extras (HTML is also auto-detected when it hides inside a .txt, e.g. SEC EDGAR filings). .docx and .pdf work out of the box via stdlib readers; the [docx]/[pdf] extras improve fidelity on complex documents (see ARCHITECTURE.md).

The two extraction tiers

extract-cli is explicit about how it knows each field — encoded in every field's source and in _meta.tiers_used.

Tier When Fields Network?
deterministic always on (default) parties, dates, defined terms, clause map, governing law, best-effort term/notice/value none
llm opt-in via --llm only renewal mechanics, obligation phrasing, ambiguous governing law yes (your provider)

The deterministic core is fully useful without the LLM. The LLM tier is opt-in, never in a hot path, and gated behind an explicit flag and a config file — if no config is present, --llm degrades gracefully with a warning and you still get the full deterministic output.

Clause-map fallback. Some documents (e.g. .docx that auto-number clauses via Word's numbering with no heading style) carry no signal the deterministic cascade can see, so its clause map comes back empty. When --llm is set and no clauses were detected, the LLM is asked for the section headings; the result is normalized through the same canonical vocabulary and emitted with tier: "llm", source: "llm", and a modest confidence (verify, not trust). When the deterministic cascade already found clauses, the LLM is not consulted for them.

Commands

extract <path>            # parse a document → structured JSON on stdout (default)
extract --catalog json    # machine-readable catalog of commands/flags (agents call at startup)
extract schema            # print the output JSON Schema (the cross-CLI contract)
extract fields            # list extractable fields and their tier
extract demo              # run on a bundled fixture and show the narrative
extract completion bash   # emit a shell-completion script (bash|zsh)

Flags

Flag Meaning
--catalog json Print the machine-readable command/flag catalog and exit (the suite discovery contract; agents call this at startup)
--llm Opt-in LLM enrichment of fuzzy fields (off by default)
--fields a,b,c Emit only a subset of top-level fields (e.g. parties,clauses)
--format json|table Output format (default json)
--no-confidence Omit confidence/source markers (reduced convenience view)
--json Force JSON to stdout (the default)
--why Rationale block on stderr
-q, --silent, --quiet Suppress non-error diagnostics
--no-color Disable ANSI color (also honors NO_COLOR / FORCE_COLOR)
-V, --version Print extract-cli X.Y.Z

Streams follow the suite convention: stdout is the machine payload (JSON), stderr is for humans (--why, warnings, errors). Exit codes: 0 success, 1 low-signal document (e.g. a scanned/empty PDF), 2 bad usage.

Output shape (abridged)

{
  "document":   { "title": "...", "format": "markdown", "sha256": "…", "source_path": "nda.md" },
  "parties":    [ { "name": "Acme Robotics, Inc.", "role": "Disclosing Party", "confidence": 0.9, "source": "deterministic" } ],
  "dates":      { "effective": { "value": "2024-03-01", "confidence": 0.85, "source": "deterministic" }, "expiration": { "value": null, "confidence": 0.0, "source": "none" } },
  "term":       { "length": { "value": "3 years", ... }, "auto_renew": { "value": true, ... }, "notice_period_days": { "value": 60, ... } },
  "governing_law": { "value": "State of Delaware", "confidence": 0.85, "source": "deterministic" },
  "jurisdiction": { "value": "US-DE", "confidence": 0.8, "source": "deterministic" },
  "clauses":    [ { "canonical_title": "Confidentiality", "detected_title": "## Confidentiality Obligations", "tier": "h2", "span": {"start": 0, "end": 120}, "confidence": 0.95, "source": "deterministic", "mapped": true } ],
  "defined_terms": [ { "term": "Confidential Information", "confidence": 0.6, "source": "deterministic" } ],
  "value":      { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
  "amounts":    [ { "value": "$50,000", "confidence": 0.6, "source": "deterministic" } ],
  "signatories": [ { "name": "Jane Doe", "title": "CEO", "confidence": 0.55, "source": "deterministic" } ],
  "_meta":      { "extractor_version": "0.1.11", "tiers_used": ["deterministic"], "llm_used": false }
}

The clause map (the differentiator)

A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's "## Confidentiality" are the same clause. extract-cli extends template-vault-cli's clause-detection cascade## H2 headings → bold-numbered **1. …** → plain numbered (1. Term, Section 3. …, two-line ARTICLE N) → ALL-CAPS lines (and an opt-in --llm fallback) — plus a built-in canonical alias vocabulary to normalize foreign clause titles onto the names the rest of the suite already speaks. Clauses it can't map are kept with mapped: false (and a * in the table view) so nothing is silently dropped.

extract counterparty.pdf | jq '.clauses[] | {canonical_title, detected_title, mapped}'

Composability — piping into the rest of the suite

extract-cli is built to be the first stage of a Unix pipe. The glue is its stdout JSON + standard tools (jq, comm) and the shared clause vocabularyextract's canonical_title values are the same names template-vault-cli detects and nda-review-cli keys policy on, so a foreign document's clauses line up with the suite's with no bespoke adapter. Every example below is runnable today (verified against the real sibling CLIs).

# 1) Inspect any contract's structure (.md/.txt/.html/.docx/.pdf, one tool).
extract counterparty.docx | jq '{parties: [.parties[].name],
  governing_law: .governing_law.value, clauses: [.clauses[].canonical_title]}'

# 2) Clause-coverage gap vs your canonical template in template-vault-cli.
#    extract normalizes the counterparty's *foreign* headings onto the same
#    clause vocabulary template-vault detects, so a plain `comm` diffs them.
template-vault info nda/mutual-standard --json | jq -r '.clauses[].title' | sort > ours.txt
extract counterparty_nda.docx | jq -r '.clauses[].canonical_title' | sort -u > theirs.txt
comm -23 ours.txt theirs.txt    # clauses in OUR standard that THEY are missing
comm -13 ours.txt theirs.txt    # clauses THEY added that we don't have

# 3) Intake: extract for structure, nda-review-cli for a policy verdict on the
#    same foreign doc; merge both views with jq.
extract counterparty_nda.docx > extract.json
nda-review review --file counterparty_nda.docx --playbook output/nda_playbook.json \
  --out-json review.json
jq -n --slurpfile e extract.json --slurpfile r review.json \
  '{parties: [$e[0].parties[].name], governing_law: $e[0].governing_law.value,
    clauses: ($e[0].clauses | length), decision: $r[0].decision, risk: $r[0].risk_score}'

# 4) Triage a folder of inbound contracts: governing law + parties per file.
for f in inbox/*; do
  extract "$f" --fields parties,governing_law --no-confidence \
    | jq -c --arg f "$f" '{file: $f, gov: .governing_law, parties: [.parties[].name]}'
done

# 5) Gate a workflow on extraction confidence (non-zero exit if any clause is shaky).
extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)' && echo "ok to review"

The integration contract is the output schema and the canonical clause vocabulary, not per-tool flags. See docs/INTEROP.md for the shared conventions and the schema's versioning commitment.

LLM configuration (opt-in)

--llm reads a shared suite config, in this order:

  1. ~/.config/contract-ops/llm.json (suite-wide — preferred)
  2. ./config/llm.json (repo-local override)

Copy config/llm.json.example to one of those paths. Configure it once and every suite tool that adopts the same lookup gets LLM features for free. Without it, --llm just warns and returns the deterministic output.

Accuracy

Line coverage tells you the code runs; it doesn't tell you the extraction is correct. make eval scores the deterministic tier against a small corpus of real, executed contracts (SEC EDGAR filings) with hand-verified ground truth (tests/eval/), reporting precision/recall per field:

Field Score
parties P 1.00 · R 0.92 · F1 0.96
effective date accuracy 1.00
governing law accuracy 1.00
jurisdiction (normalized) accuracy 1.00
clauses (recall on verified sections) 0.45

Clause recall is the honest weak spot — heading detection on dense HTML exhibits still misses sections. A test (tests/test_eval.py) gates these so accuracy can't silently regress.

Development

make install      # editable install with the [dev] extra
make test         # full suite
make coverage     # suite + coverage report (installs extras; fails under 100%)
make typecheck    # mypy --strict
make eval         # accuracy benchmark vs the labeled corpus
make build        # wheel + sdist
make smoke        # build, install the wheel in a clean venv, run it
make spec-check   # assert docs/spec schema == `extract schema`
make release VERSION=X.Y.Z

See ARCHITECTURE.md and CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_cli-0.1.13.tar.gz (187.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_cli-0.1.13-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file extract_cli-0.1.13.tar.gz.

File metadata

  • Download URL: extract_cli-0.1.13.tar.gz
  • Upload date:
  • Size: 187.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for extract_cli-0.1.13.tar.gz
Algorithm Hash digest
SHA256 58c9901aaf98d6582c96c892d60e4060fb052bff40fbf6d864f2e0d69d7ad58f
MD5 6f69a3615cc940cf309dbea927514e2c
BLAKE2b-256 f2c143b348c57f939026cb378a480872a3e6f78f59b94f49a7488162fd765ac4

See more details on using hashes here.

File details

Details for the file extract_cli-0.1.13-py3-none-any.whl.

File metadata

  • Download URL: extract_cli-0.1.13-py3-none-any.whl
  • Upload date:
  • Size: 39.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for extract_cli-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 ad10c6a1315d41882e0e8b4bf08f6c43f27fec823eca65a032f3dddbfb73fa57
MD5 f3fd2734e49db5fd346a008ca8fb490a
BLAKE2b-256 18714e65deadb8fa1717a5da4ff33a5424c39ca955a418458dc271843935656f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page