Skip to main content

Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.html/.docx/.pdf) and emit structured JSON.

Project description

extract-cli

Part of the contract-ops CLI suite. extract-cli is the suite's passport control — the open-loop front door. The rest of the suite is a closed loop that only handles documents it authored from its own templates; extract-cli ingests any document (yours or a counterparty's foreign paper) and emits a structured representation the pipeline can consume: template-vault-cli (storage) feeds draft-cli (fill placeholders) → nda-review-cli (review, redline, negotiate) → docx2pdf-cli (DOCX → PDF) → sign-cli (signing + audit). Cross-version drift detection via compare-cli. Showcase site.

extract-cli sits upstream of review: it turns foreign paper into the suite's canonical, structured vocabulary. Its output is a cross-CLI data contract — see docs/INTEROP.md and docs/spec/extract-output.schema.json.

ingest (extract) → review → diff → convert → sign
   ^you are here

Run this

pipx run extract-cli demo        # zero-config: extract a bundled NDA → structured JSON
# or, installed:  pip install extract-cli && extract demo

That prints the full output contract — parties, dates, term, governing law, and a clause map normalized onto the suite's canonical vocabulary — for a bundled fixture, with no setup and no network. Point it at your own file with extract path/to/contract.docx.

Where to go next

What it does

Give it a contract in .md / .txt / .html (native), .docx, or .pdf, and it returns structured JSON: the parties, dates, term, governing law, a clause map normalized onto the suite's canonical clause vocabulary, a defined-term inventory, and a headline value. Every field carries a confidence and a source so downstream tools verify, don't trust.

It is stdlib-only, single-file, terminal-first, and composable. No DB, no daemon, no network in the default path.

Install

pip install extract-cli                 # core: .md/.txt/.html + best-effort .docx/.pdf
pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
pip install "extract-cli[docx,pdf]"     # both

The core has zero runtime dependencies and is fully functional on .md/.txt/.html with no extras (HTML is also auto-detected when it hides inside a .txt, e.g. SEC EDGAR filings). .docx and .pdf work out of the box via stdlib readers; the [docx]/[pdf] extras improve fidelity on complex documents (see ARCHITECTURE.md).

The two extraction tiers

extract-cli is explicit about how it knows each field — encoded in every field's source and in _meta.tiers_used.

Tier When Fields Network?
deterministic always on (default) parties, dates, defined terms, clause map, governing law, best-effort term/notice/value none
llm opt-in via --llm only renewal mechanics, obligation phrasing, ambiguous governing law yes (your provider)

The deterministic core is fully useful without the LLM. The LLM tier is opt-in, never in a hot path, and gated behind an explicit flag and a config file — if no config is present, --llm degrades gracefully with a warning and you still get the full deterministic output.

Clause-map fallback. Some documents (e.g. .docx that auto-number clauses via Word's numbering with no heading style) carry no signal the deterministic cascade can see, so its clause map comes back empty. When --llm is set and no clauses were detected, the LLM is asked for the section headings; the result is normalized through the same canonical vocabulary and emitted with tier: "llm", source: "llm", and a modest confidence (verify, not trust). When the deterministic cascade already found clauses, the LLM is not consulted for them.

Commands

extract <path>            # parse a document → structured JSON on stdout (default)
extract --catalog json    # machine-readable catalog of commands/flags (agents call at startup)
extract schema            # print the output JSON Schema (the cross-CLI contract)
extract fields            # list extractable fields and their tier
extract demo              # run on a bundled fixture and show the narrative
extract completion bash   # emit a shell-completion script (bash|zsh)

Flags

Flag Meaning
--catalog json Print the machine-readable command/flag catalog and exit (the suite discovery contract; agents call this at startup)
--llm Opt-in LLM enrichment of fuzzy fields (off by default)
--fields a,b,c Emit only a subset of top-level fields (e.g. parties,clauses)
--format json|table Output format (default json)
--no-confidence Omit confidence/source markers (reduced convenience view)
--json Force JSON to stdout (the default)
--why Rationale block on stderr
-q, --silent, --quiet Suppress non-error diagnostics
--no-color Disable ANSI color (also honors NO_COLOR / FORCE_COLOR)
-V, --version Print extract-cli X.Y.Z

Streams follow the suite convention: stdout is the machine payload (JSON), stderr is for humans (--why, warnings, errors). Exit codes: 0 success, 1 low-signal document (e.g. a scanned/empty PDF), 2 bad usage.

Output shape (abridged)

{
  "document":   { "title": "...", "format": "markdown", "sha256": "…", "source_path": "nda.md" },
  "parties":    [ { "name": "Acme Robotics, Inc.", "role": "Disclosing Party", "confidence": 0.9, "source": "deterministic" } ],
  "dates":      { "effective": { "value": "2024-03-01", "confidence": 0.85, "source": "deterministic" }, "expiration": { "value": null, "confidence": 0.0, "source": "none" } },
  "term":       { "length": { "value": "3 years", ... }, "auto_renew": { "value": true, ... }, "notice_period_days": { "value": 60, ... } },
  "governing_law": { "value": "State of Delaware", "confidence": 0.85, "source": "deterministic" },
  "jurisdiction": { "value": "US-DE", "confidence": 0.8, "source": "deterministic" },
  "clauses":    [ { "canonical_title": "Confidentiality", "detected_title": "## Confidentiality Obligations", "tier": "h2", "span": {"start": 0, "end": 120}, "confidence": 0.95, "source": "deterministic", "mapped": true } ],
  "defined_terms": [ { "term": "Confidential Information", "confidence": 0.6, "source": "deterministic" } ],
  "value":      { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
  "amounts":    [ { "value": "$50,000", "confidence": 0.6, "source": "deterministic" } ],
  "signatories": [ { "name": "Jane Doe", "title": "CEO", "confidence": 0.55, "source": "deterministic" } ],
  "_meta":      { "extractor_version": "0.1.11", "tiers_used": ["deterministic"], "llm_used": false }
}

The clause map (the differentiator)

A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's "## Confidentiality" are the same clause. extract-cli extends template-vault-cli's clause-detection cascade## H2 headings → bold-numbered **1. …** → plain numbered (1. Term, Section 3. …, two-line ARTICLE N) → ALL-CAPS lines (and an opt-in --llm fallback) — plus a built-in canonical alias vocabulary to normalize foreign clause titles onto the names the rest of the suite already speaks. Clauses it can't map are kept with mapped: false (and a * in the table view) so nothing is silently dropped.

extract counterparty.pdf | jq '.clauses[] | {canonical_title, detected_title, mapped}'

Composability — piping into the rest of the suite

extract-cli is built to be the first stage of a Unix pipe. The glue is its stdout JSON + standard tools (jq, comm) and the shared clause vocabularyextract's canonical_title values are the same names template-vault-cli detects and nda-review-cli keys policy on, so a foreign document's clauses line up with the suite's with no bespoke adapter. Every example below is runnable today (verified against the real sibling CLIs).

# 1) Inspect any contract's structure (.md/.txt/.html/.docx/.pdf, one tool).
extract counterparty.docx | jq '{parties: [.parties[].name],
  governing_law: .governing_law.value, clauses: [.clauses[].canonical_title]}'

# 2) Clause-coverage gap vs your canonical template in template-vault-cli.
#    extract normalizes the counterparty's *foreign* headings onto the same
#    clause vocabulary template-vault detects, so a plain `comm` diffs them.
template-vault info nda/mutual-standard --json | jq -r '.clauses[].title' | sort > ours.txt
extract counterparty_nda.docx | jq -r '.clauses[].canonical_title' | sort -u > theirs.txt
comm -23 ours.txt theirs.txt    # clauses in OUR standard that THEY are missing
comm -13 ours.txt theirs.txt    # clauses THEY added that we don't have

# 3) Intake: extract for structure, nda-review-cli for a policy verdict on the
#    same foreign doc; merge both views with jq.
extract counterparty_nda.docx > extract.json
nda-review review --file counterparty_nda.docx --playbook output/nda_playbook.json \
  --out-json review.json
jq -n --slurpfile e extract.json --slurpfile r review.json \
  '{parties: [$e[0].parties[].name], governing_law: $e[0].governing_law.value,
    clauses: ($e[0].clauses | length), decision: $r[0].decision, risk: $r[0].risk_score}'

# 4) Triage a folder of inbound contracts: governing law + parties per file.
for f in inbox/*; do
  extract "$f" --fields parties,governing_law --no-confidence \
    | jq -c --arg f "$f" '{file: $f, gov: .governing_law, parties: [.parties[].name]}'
done

# 5) Gate a workflow on extraction confidence (non-zero exit if any clause is shaky).
extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)' && echo "ok to review"

The integration contract is the output schema and the canonical clause vocabulary, not per-tool flags. See docs/INTEROP.md for the shared conventions and the schema's versioning commitment.

LLM configuration (opt-in)

--llm reads a shared suite config, in this order:

  1. ~/.config/contract-ops/llm.json (suite-wide — preferred)
  2. ./config/llm.json (repo-local override)

Copy config/llm.json.example to one of those paths. Configure it once and every suite tool that adopts the same lookup gets LLM features for free. Without it, --llm just warns and returns the deterministic output.

Development

make install      # editable install with the [dev] extra
make test         # full suite
make coverage     # suite + coverage report
make typecheck    # mypy --strict
make build        # wheel + sdist
make smoke        # build, install the wheel in a clean venv, run it
make spec-check   # assert docs/spec schema == `extract schema`
make release VERSION=X.Y.Z

See ARCHITECTURE.md and CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_cli-0.1.11.tar.gz (88.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_cli-0.1.11-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file extract_cli-0.1.11.tar.gz.

File metadata

  • Download URL: extract_cli-0.1.11.tar.gz
  • Upload date:
  • Size: 88.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for extract_cli-0.1.11.tar.gz
Algorithm Hash digest
SHA256 fd18d7ec7e0e03ed19fa9f5eed1916a859d7ea8347a67638aba1d54330e046e5
MD5 8bc4191c7d6ff5b4b3d4b1ffcfbc8124
BLAKE2b-256 1619b3d2cac9e0a11702391a0d6ab5c90fcd4accc1c3efe0d24f187acdd00ecc

See more details on using hashes here.

File details

Details for the file extract_cli-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: extract_cli-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for extract_cli-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 8b090cfd6285081af89e7af71a9bb2935316f82bffb4941cfe726dfbbd2598f1
MD5 515cc74a0fb1a7f5a1ccb6260b7fba15
BLAKE2b-256 39cda5e4699d201f2914acb9dcc163bc2d8e8cbbd0f072ebd50f003ed666e9ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page