Skip to main content

Open-loop front door of the contract-ops CLI suite: ingest any contract (.md/.txt/.docx/.pdf) and emit structured JSON.

Project description

extract-cli

Part of the contract-ops CLI suite. extract-cli is the suite's passport control — the open-loop front door. The rest of the suite is a closed loop that only handles documents it authored from its own templates; extract-cli ingests any document (yours or a counterparty's foreign paper) and emits a structured representation the pipeline can consume: template-vault-cli (storage) feeds draft-cli (fill placeholders) → nda-review-cli (review, redline, negotiate) → docx2pdf-cli (DOCX → PDF) → sign-cli (signing + audit). Cross-version drift detection via compare-cli. Showcase site.

extract-cli sits upstream of review: it turns foreign paper into the suite's canonical, structured vocabulary. Its output is a cross-CLI data contract — see docs/INTEROP.md and docs/spec/extract-output.schema.json.

ingest (extract) → review → diff → convert → sign
   ^you are here

What it does

Give it a contract in .md / .txt (native), .docx, or .pdf, and it returns structured JSON: the parties, dates, term, governing law, a clause map normalized onto the suite's canonical clause vocabulary, a defined-term inventory, and a headline value. Every field carries a confidence and a source so downstream tools verify, don't trust.

It is stdlib-only, single-file, terminal-first, and composable. No DB, no daemon, no network in the default path.

Install

pip install extract-cli                 # core: .md/.txt + best-effort .docx/.pdf
pip install "extract-cli[docx]"         # higher-fidelity .docx (python-docx)
pip install "extract-cli[pdf]"          # higher-fidelity .pdf (pypdf)
pip install "extract-cli[docx,pdf]"     # both

The core has zero runtime dependencies and is fully functional on .md/.txt with no extras. .docx and .pdf work out of the box via stdlib readers; the [docx]/[pdf] extras improve fidelity on complex documents (see ARCHITECTURE.md).

The two extraction tiers

extract-cli is explicit about how it knows each field — encoded in every field's source and in _meta.tiers_used.

Tier When Fields Network?
deterministic always on (default) parties, dates, defined terms, clause map, governing law, best-effort term/notice/value none
llm opt-in via --llm only renewal mechanics, obligation phrasing, ambiguous governing law yes (your provider)

The deterministic core is fully useful without the LLM. The LLM tier is opt-in, never in a hot path, and gated behind an explicit flag and a config file — if no config is present, --llm degrades gracefully with a warning and you still get the full deterministic output.

Commands

extract <path>            # parse a document → structured JSON on stdout (default)
extract schema            # print the output JSON Schema (the cross-CLI contract)
extract fields            # list extractable fields and their tier
extract demo              # run on a bundled fixture and show the narrative
extract completion bash   # emit a shell-completion script (bash|zsh)

Flags

Flag Meaning
--llm Opt-in LLM enrichment of fuzzy fields (off by default)
--fields a,b,c Emit only a subset of top-level fields (e.g. parties,clauses)
--format json|table Output format (default json)
--no-confidence Omit confidence/source markers (reduced convenience view)
--json Force JSON to stdout (the default)
--why Rationale block on stderr
-q, --silent, --quiet Suppress non-error diagnostics
--no-color Disable ANSI color (also honors NO_COLOR / FORCE_COLOR)
-V, --version Print extract-cli X.Y.Z

Streams follow the suite convention: stdout is the machine payload (JSON), stderr is for humans (--why, warnings, errors). Exit codes: 0 success, 1 low-signal document (e.g. a scanned/empty PDF), 2 bad usage.

Output shape (abridged)

{
  "document":   { "title": "...", "format": "markdown", "sha256": "…", "source_path": "nda.md" },
  "parties":    [ { "name": "Acme Robotics, Inc.", "role": "Disclosing Party", "confidence": 0.9, "source": "deterministic" } ],
  "dates":      { "effective": { "value": "2024-03-01", "confidence": 0.85, "source": "deterministic" }, "expiration": { "value": null, "confidence": 0.0, "source": "none" } },
  "term":       { "length": { "value": "3 years", ... }, "auto_renew": { "value": true, ... }, "notice_period_days": { "value": 60, ... } },
  "governing_law": { "value": "State of Delaware", "confidence": 0.85, "source": "deterministic" },
  "clauses":    [ { "canonical_title": "Confidentiality", "detected_title": "## Confidentiality Obligations", "tier": "h2", "span": {"start": 0, "end": 120}, "confidence": 0.95, "source": "deterministic", "mapped": true } ],
  "defined_terms": [ { "term": "Confidential Information", "confidence": 0.6, "source": "deterministic" } ],
  "value":      { "value": "$50,000", "confidence": 0.6, "source": "deterministic" },
  "_meta":      { "extractor_version": "0.1.0", "tiers_used": ["deterministic"], "llm_used": false }
}

The clause map (the differentiator)

A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's "## Confidentiality" are the same clause. extract-cli reuses template-vault-cli's clause-detection cascade (Tier 1 ## H2 headings → Tier 2 bold-numbered **1. …** → Tier 3 ALL-CAPS lines) and a built-in canonical alias vocabulary to normalize foreign clause titles onto the names the rest of the suite already speaks. Clauses it can't map are kept with mapped: false (and a * in the table view) so nothing is silently dropped.

extract counterparty.pdf | jq '.clauses[] | {canonical_title, detected_title, mapped}'

Composability — piping into the rest of the suite

extract-cli is built to be the first stage of a Unix pipe. Its JSON is the contract every downstream tool reads.

# 1) Foreign NDA → review. extract normalizes clauses; nda-review runs policy.
extract counterparty_nda.pdf | nda-review review --from-extract -

# 2) Pull just the clause map and feed compare-cli to diff a foreign doc
#    against your canonical template's structure.
extract their_msa.docx --fields clauses | compare-cli align --stdin \
  --against msa/standard

# 3) Archive structured metadata for any inbound paper into the post-signature
#    vault, keyed by content hash.
extract signed_contract.pdf | contract-vault put --from-extract - \
  --id "$(extract signed_contract.pdf | jq -r .document.sha256)"

# 4) Triage a folder of inbound contracts: list governing law + parties.
for f in inbox/*.pdf; do
  extract "$f" --fields parties,governing_law --no-confidence \
    | jq -c '{file: input_filename, gov: .governing_law, parties: [.parties[].name]}'
done

# 5) Gate a workflow on extraction confidence.
extract draft.docx | jq -e '.clauses | all(.confidence > 0.7)' && echo "ok to review"

The --from-extract/--stdin flags above are the consumption points the sibling CLIs expose (or are adopting) for this contract; see docs/INTEROP.md for the shared conventions and the versioning commitment on the schema.

LLM configuration (opt-in)

--llm reads a shared suite config, in this order:

  1. ~/.config/contract-ops/llm.json (suite-wide — preferred)
  2. ./config/llm.json (repo-local override)

Copy config/llm.json.example to one of those paths. Configure it once and every suite tool that adopts the same lookup gets LLM features for free. Without it, --llm just warns and returns the deterministic output.

Development

make install      # editable install with the [dev] extra
make test         # full suite
make coverage     # suite + coverage report
make typecheck    # mypy --strict
make build        # wheel + sdist
make smoke        # build, install the wheel in a clean venv, run it
make spec-check   # assert docs/spec schema == `extract schema`
make release VERSION=X.Y.Z

See ARCHITECTURE.md and CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_cli-0.1.0.tar.gz (53.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_cli-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file extract_cli-0.1.0.tar.gz.

File metadata

  • Download URL: extract_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 53.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for extract_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a1336a423cb613d97109c62a7eeb304e8acf71eb4e6b925f5ddef95f81a0a830
MD5 0c9d92e37de4d4eb4b01a2799da76ae4
BLAKE2b-256 c03b464283f19926d4efa028ba4ab948d6acdbdf6ec25f3a835f12636f8edf5a

See more details on using hashes here.

File details

Details for the file extract_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: extract_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for extract_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 62127a7376f029cf4630722a8982c875f312f16b10774d4b8b52559a28f3fedf
MD5 54d0fe757d5f8b05e45f74b6de4075be
BLAKE2b-256 1395a920875739b06cd509b395f6b803b7e358cfcdc6c7fed9a91de01b47695b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page