Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page.
Project description
📄 BidReader
Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.
Every line item carries its page, the exact source text it came from, and an arithmetic check (qty × unit_price == amount) — verification on top of extraction, not just an LLM guess.
"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font." — r/Construction, 498 upvotes (source)
The construction-AI gold rush is all chasing the same crowded, resisted thing — autonomous takeoff. The loudest unmet pain of estimators is upstream and downstream of it: wrangling crime-scene PDFs into clean data, and catching what subcontractors quietly excluded before it costs six figures on the job.
No permissively-licensed library did this. BidReader is that primitive — MIT, pip install, runs on free LLMs, and callable from any AI agent over MCP.
Quickstart (copy-paste, ~30 seconds)
pip install bidreader
# Use any one — a FREE key works (see docs/FREE_MODELS.md):
export GEMINI_API_KEY=... # free at aistudio.google.com
# or export OPENROUTER_API_KEY=... (has :free models)
# or export REQUESTY_API_KEY=...
bidreader your_sub_quote.pdf
from bidreader import read
doc = read("sub_quote.pdf")
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
doc.exclusions # [{item, quote, page, risk}, ...] <- the buried stuff
doc.scope_gaps # trade-standard scope NOT in the doc — confirm before bidding
doc.to_json()
Real output
On a real $324,240.61 drywall estimate (72 line items, scanned in seconds), BidReader's scope engine caught a genuinely expensive hole:
!! SCOPE GAPS TO CONFIRM:
- Finishing (taping, mudding, sanding) -- the gypsum line items price the BOARD
only, not the finishing labor to reach a paint-ready surface.
- Door hardware -- "Door W/ Frame" lines don't include hinges/locks/closers.
- Firestopping at rated assemblies -- life-safety scope, commonly omitted.
On a real 25-page multi-trade GC estimate, it parsed 959 line items across 16 CSI divisions (demolition → concrete → steel → finishes → plumbing → fire suppression), each page-cited. See docs/RESULTS.md and a full worked example in examples/.
Use it from an AI agent (MCP)
pip install "bidreader[mcp]"
{ "mcpServers": { "bidreader": {
"command": "bidreader-mcp",
"env": { "GEMINI_API_KEY": "..." }
}}}
Tools: read_document, catch_exclusions, extract_line_items. Now your agent can answer "which subs excluded fire-stopping across this bid folder?" Full guide: docs/MCP.md.
How it works
PDF (sub-quote / bid package / spec / schedule)
→ page-tagged text extraction (PyMuPDF)
→ chunk by page (scales to 25+ page, 900+ line-item estimates)
→ LLM structured extraction (line items · exclusions · assumptions · alternates · scope gaps)
→ merge + page-cited output (JSON / CLI / MCP)
Text-based, so it runs great on free models — see docs/FREE_MODELS.md.
Benchmark
Reproducible ground-truth benchmark (benchmark/) — synthetic docs we author, so truth is exact and the PDFs ship in-repo:
| metric | score |
|---|---|
| Line-item recall | 100% |
| Exclusion-catch recall (incl. prose-buried) | 100% |
| No-hallucination rate (clean docs) | 100% |
| Bid-total accuracy (±2%) | 100% |
| Arithmetic errors caught | 2/2, 0 false positives |
Honest caveat: synthetic docs are cleaner than real scans — these are an upper bound on well-structured input, not a claim about messy real bids. Uncontrolled real-document results are in docs/RESULTS.md. Reproduce: python benchmark/generate.py && python benchmark/run.py.
Why this, and why now — the evidence
A full write-up (problem, market data, prior-art gap, method, results) is in PAPER.md. The short version:
- Loudest, most-shared pain in construction-estimating communities (the 498-upvote thread above; more cited in the paper).
- It works today — document extraction is LLM-native, unlike floor-plan symbol detection (academic SOTA tops out ~83% mAP).
- Empty slot —
bidreader,blueprint-parser,pytakeoffwere all unclaimed on PyPI; the only adjacent tools are AGPL/non-commercial or abandoned toys. - Broadest base — every estimator and every construction-AI builder needs document extraction. The library is the dependency; the MCP server is the agent-era surface.
Roadmap
- Scanned-PDF vision OCR path
- Revision/addendum diff ("what changed between Addendum 3 and 4")
- Excel/CSV BOQ export + multi-quote leveling (compare subs side-by-side)
- Region/trade notation packs (AISC, BS/IS, AUS)
Contributing
PRs welcome — see CONTRIBUTING.md. Good first issues: add a notation parser, a new export format, or a test fixture.
License
MIT © 2026. Cite via CITATION.cff.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bidreader-0.5.0.tar.gz.
File metadata
- Download URL: bidreader-0.5.0.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef46cf95689173af52c1f0e0a5d5b61fadab28570ea290ae47a2752c7009354f
|
|
| MD5 |
b4be8240debe0a09123b531dfe24a321
|
|
| BLAKE2b-256 |
9d1fea52a6d5fec36701aae274c64210f91ae49fccf3f269e42a3ca9c81f0adc
|
File details
Details for the file bidreader-0.5.0-py3-none-any.whl.
File metadata
- Download URL: bidreader-0.5.0-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
007248eeae84e65bebf3f3ebd47e074358b47df943c51cc3e2a3669ced755b57
|
|
| MD5 |
68177a792b442f5655e7c7ff7377b1ba
|
|
| BLAKE2b-256 |
332ae2675a35f6ef95ea4776b7b3b508550bbe83a2dc7d468c65ba2071bec5f6
|