Mdtero local CLI for source-first paper acquisition, parsing, Zotero import, and RAG handoff.
Project description
Mdtero Private Backend
This repository is the production backend source of truth for Mdtero.
Public beta messaging lives in the public repos. This repository stays focused on:
- FastAPI services and task orchestration
- auth, API keys, billing, and usage accounting
- parsing, translation, artifact packaging, and retrieval adapters
- source-first agent skill documents served from
api.mdtero.com
Public / Private Split
mdtero: website, dashboard, guide, API docs, and public extension source/install guidancemdtero-backend: backend-only services, deployment assets, source-routing CLI surfaces, and private operator docs- Current first-party extension scope is limited to web OAuth login/session handoff, parse, translate, PDF/EPUB upload, artifact download, and install guidance; do not add extension UI/source here.
Deployment Truth
- Production
mdtero-apiis served from OCI SG AMD-2 behind Caddy athttps://api.mdtero.com; pushes tomainno longer deploy Cloud Run automatically. - Use the manual GitHub Actions workflow in
.github/workflows/deploy-backend.ymlto run tests, build/push the production image to GHCR asghcr.io/jonbinc/mdtero-api, execute the deploy job on the AMD-2 self-hosted runner, update/opt/mdtero-api/docker-compose.yml, and smoke-test production. Seedocs/ops/amd2-backend-deploy.md. cloudbuild.yamlis retained as a rollback-only Cloud Run definition. Do not attach it to an automatic Cloud Build trigger unless intentionally rolling production back to Cloud Run.- Local filesystem storage on AMD-2 is the production default artifact backend:
STORAGE_BACKEND=local,LOCAL_STORAGE_DIR=/app/storage, host path/opt/mdtero-api/storage, container path/app/storage. - GCS artifact storage remains rollback-only / legacy compatibility; do not treat
GCS_BUCKET_NAMEorGOOGLE_APPLICATION_CREDENTIALSas the default AMD-2 production path. - Live discovery search is production-enabled only after the runtime secret store contains the provider key; bind
OPENALEX_API_KEYand setMDTERO_ENABLE_DISCOVERY_SEARCH=truein the same rollout. Do not put the key value in source, and do not reference a missing secret from the default deploy path. - Standalone GROBID is retired and is not part of the GitHub/Cloud Run production deployment chain;
docs/ops/standalone-grobid-cloud-run.mdis kept only as an archival note. - Direct upload-PDF / cloud PDF handling is the maintained PDF path. The longer-term direction is to move the remaining scholarly PDF credentials and routing into MinerU-managed production configuration rather than bringing back a separate GROBID deploy.
- This repository is the only backend code SSOT; do not recreate a shadow backend under
mdtero/private/backendor any other parallel path. - Do not assume a push to
mdterodeploys the public backend API.
When the public install flow changes, update the matching skill or install guide here in the same round.
Install / Bootstrap Truth
- The formal
mdteroCLI release path is PyPI +uv tool install mdtero. curl -Ls https://mdtero.com/install.sh | sh -s -- --agent <target>remains a legacy/bootstrap path until the public install surfaces are fully switched over.npm install -g mdtero-install@0.1.8is a legacy/bootstrap package, not the formalmdteroCLI release channel.- This repository already defines the intended Python package in
pyproject.toml:project.name = "mdtero"andproject.scripts.mdtero = "mdtero_cli:main". npx mdtero-install install <target>installs the matching agent skill bundle; it is not part of the formalmdteroCLI release gate.uv tool install mdterois the formal user install path for the published Python package; that package includes the local CLI surfaces forcurl_cffiacquisition, Zotero import, RAG, MCP handoff, parse, translate, status, and download.- OpenClaw remains separate: use
clawhub install mdtero, not the install-script target list.
Maintainer-only Python package smoke before publishing:
python3.12 -m venv .venv
./.venv/bin/python -m pip install -r requirements.txt
./.venv/bin/python -m pip wheel --no-deps -w /tmp/mdtero-wheel .
UV_TOOL_DIR=/tmp/mdtero-tools UV_TOOL_BIN_DIR=/tmp/mdtero-bin \
uv tool install /tmp/mdtero-wheel/mdtero-*.whl --python 3.12
/tmp/mdtero-bin/mdtero --help
The smoke must show one installed mdtero executable and mdtero --help must list parse-files, zotero import, rag serve, and the other local runtime commands. Publishing to PyPI is a separate maintainer action and is the only formal external release step for the CLI.
For maintainers, the repo now separates package verification from external release:
.github/workflows/package-mdtero-cli.ymlbuilds and smokes the package without publishing..github/workflows/publish-mdtero-cli.ymlis the manual Trusted Publishing path for PyPI after themdteroproject on PyPI is linked to this GitHub repository.- Before the first external publish, create the PyPI account/project binding for
mdtero, enable Trusted Publishing for this repository/workflow, and then trigger the publish workflow manually.
Repository Map
service: app routes, orchestration, auth, billing, and provider adapterstests/service: backend test coverageservice/legacy_parser: quarantined compatibility parser code that no longer owns runtime designtests/legacy_parser: regression coverage for compatibility parser codeskills: install and workflow documents served to agentsscripts: operator and migration utilitiesdocs: internal product and engineering notesdocs/partner: partner-facing API, capability, and maturity packagedocs/feedback_audit_ssot.md: quality feedback review SSOT for human and AI-assisted auditingdocs/architecture/backend-ssot-refactor-blueprint.md: backend-first SSOT and readability refactor blueprintdocs/superpowers/specs/2026-05-06-cloud-parse-sdk-design.md: planned Python import API boundary for cloud parse tasksdocs/PYTHON_SURFACE_MAP.md: runtime-vs-compatibility ownership map for Python filesdocs/RUNTIME_BOUNDARY_RULES.md: repository boundary rules for runtime owners vs. labs, runs, tmp, and scriptsdocs/SCRIPT_SURFACE_INDEX.md: maintained script-bucket ownership index
Experimental Parsing Progress
The production parse chain is still the release gate.
In parallel, the experimental parser_v2 line under service/parser_v2 is now materially ahead on architecture and source coverage:
- shared
AST -> Markdownkernel JATS,Elsevier XML, andTEIimporters- server-side
curl_cffiHTML/XML/EPUB parsing with publisher adapters and quality gates - local/OA
EPUBparsing - experimental
PDF -> GROBID -> TEI -> AST -> Markdownfallback - experimental unified upload entrypoint
POST /tasks/parse-fulltext-v2for structured full-text handoff - authenticated runtime diagnostics entrypoint
GET /diagnostics/parser-v2/shadowfor shadow-flag visibility
Mainline adoption status on 2026-03-25:
arxiv_nativeis already executed through the V2 AST/Markdown kernel on the production parse path- generic structured XML uploads now route through the V2 structured-XML path for supported families (
Elsevier XML,JATS,TEI) even when entering from the legacy/tasks/parse-uploadsurface - remote structured-fulltext routes that land on uploaded XML parsing therefore also benefit from the same V2 normalization path
PDF -> GROBID -> TEI -> AST -> Markdownis now considered the only maintained scholarly PDF fallback path through the V2 upload surfaces, and should remain low-profile rather than a promoted primary route- non-PDF project attachments are being standardized behind a separate
MarkItDownsidecar plan so generic file ingestion does not pollute the scholarly parser runtime - Playwright, browser extension capture, browser bridge, helper-bundle upload, and helper self-update are retired from the maintained CLI/runtime path
- legacy XML upload wrappers are now thin delegators into
service/parser_v2/uploaded_parse.py - production parse subprocess now enters through
python -m service.parse_cli, which keeps the runtime entry stable while the underlying implementation continues migrating service.parse_clinow delegates intoservice/parser_v2/cli.pyinstead of importing the root legacy parser directly- the arXiv runtime flow is now owned by
service/parser_v2/arxiv_runtime.py; legacy arXiv compatibility code lives underservice/legacy_parser/ - AI-markdown sidecar rendering is now owned by
service/parser_v2/markdown.pyrather than the root parser module - structured XML figure/table asset localization should now be source-first:
- importers should preserve native figure references where available
- missing assets should be fused from publisher HTML/PDF/MinerU sources where lawful and available
- raw full-text cache policy is now short-lived by default:
L1ephemeral execution cache stays at24hL2user-private raw cache now defaults to7 daysL4public-open raw cache now defaults to7 days
- source-first production rollout should prefer native/API/XML/EPUB/HTML routes before PDF fallback:
/tasks/parseshould route DOI / URL acquisition through server/native/legal machine-friendly formats when available- local-only content should enter through direct upload or Zotero attachment import, not browser automation
- CLI / signed-in frontend can inspect current connector shadow posture through
GET /diagnostics/parser-v2/shadow - local project authority for the helper CLI now starts with
mdtero initand can be inspected withmdtero status --json, which reports.mdtero/identity, config-source precedence, and stored diagnostics without implying parse readiness
- legacy parser compatibility code now lives under
service/legacy_parser/; repository root should not host parser entrypoints - the current Python ownership map and archive boundary are tracked in
docs/PYTHON_SURFACE_MAP.md - repository boundary rules for
runs/,labs/,tmp/, andscripts/are tracked indocs/RUNTIME_BOUNDARY_RULES.md
Current experimentally validated connector status:
- promotion-ready now:
arxiv_nativeeurope_pmcploselife_jats_xmlbiorxivmedrxivmdpi_epub_assetspringer_openaccess_apielsevier_article_retrieval_apispringer_subscription_connectorwiley_tdmtaylor_francis_tdm
Important rollout interpretation:
promotion-readymeans a connector is strong enough forshadow / feature-flag, not that it should immediately become the new global production default- the current single-source summary for
shadowvsdefault cutoveris:labs/vendor_promotion_validation/MATURITY_MATRIX.mdruns/vendor_promotion_validation/shadow-rollout.json
- as of
2026-03-27, the practical posture is:- already-live V2 behavior:
arxiv_native, uploaded structured XML, uploaded PDF fallback, source-first HTML/XML/EPUB routing - next shadow-first connectors:
springer_subscription_connector,wiley_tdm,taylor_francis_tdm,springer_openaccess_api,elsevier_article_retrieval_api - still not valid to describe as default production automatic acquisition:
Wiley browser-bridge HTML,MDPIdirect server fetch,Taylor & Francis OA EPUB
- already-live V2 behavior:
Practical acquisition interpretation:
Europe PMC,PLOS,eLife,Springer OA,bioRxiv,medRxiv, andMDPI EPUBare already executable open structured routesElsevieris modeled asapi_firstunder user entitlement or authorized API environment; Mdtero does not imply public accessElsevier XMLuploaded or fetched through authorized acquisition lands on the V2 structured importer path on the main backend surfaceWiley,Springer subscription, andTaylor & Francisshould prefer source-first HTML/XML/PDF routes wherecurl_cffior official APIs can fetch without browser automationWileyhas experimental official TDM evidence:Wiley TDM PDF -> GROBID -> Markdown- live validation on this machine succeeded for
10.1002/er.7490,10.1002/er.6487, and10.1002/sam.11700 - observed fetch time was about
2.5s - 4.1s, with end-to-endPDF -> Markdownaround8son a warm local GROBID container
- Browser acquisition is no longer a product route. Historical browser-extension and Playwright validation remains useful as archive evidence only; do not revive it as fallback.
Current architectural direction:
servershould increasingly act as discovery, routing, parsing, normalization, rendering, and structured persistence- production acquisition should default to source-first server/native routes or explicit user-provided files/attachments
- helper/browser-first is retired; direct server-side DOI/URL fetching with lawful machine-friendly formats is the release target when quality gates pass
- local project ingest now has a project-owned pre-parse ledger:
mdtero ingestrecords DOI/URL/local-file provenance into.mdtero/state/ingest-ledger.sqlite3, andmdtero papersprojects honest readiness states such asmetadata_only,oa_location_found,fulltext_staged, andmanual_action_requiredwithout implying parse success - local project parse now extends that same ledger with append-only
parse_attempts, andmdtero parse --jsonoperates on existing ingested project records instead of bypassing project authority - local project Zotero import now supports both fixture replay and real read-only library access:
mdtero zotero import --fixture tests/service/fixtures/zotero/sample-library.json --jsonreplays fixture data, whilemdtero zotero import --library-id <id> --library-type user|group --api-key <key> --json(or--localfor Zotero local API) records durable Zotero mappings plus attachment discovery into the same ledger without promising sync or write-back - local project RAG/chat now extends the same ledger with
rag_buildsandrag_chunks;mdtero rag build --jsonindexes parsed Markdown into project-owned retrieval state andmdtero chat --jsonanswers from local lexical evidence while preserving the shared grounded-chat response shape - local project dashboard now projects the same ledger through
mdtero dashboard --jsonand plain-textmdtero dashboard, keepingplanned_lane,actual_lane,parser_label,artifact_outcome,reason_code, Zotero attachment evidence, RAG readiness, and derived operator actions visible without introducing dashboard-owned state - the first local parse path is intentionally narrow: staged local PDFs only, routed through the uploaded-PDF adapter seam so backend-owned lane/parser/failure vocabulary remains authoritative
mdtero papers --jsonnow exposes latest parse status plus durable parse history so downstream Zotero/RAG/TUI slices can consume the same local truth without reparsing CLI text- server-side fetch should remain an optional convenience or coverage fallback, not the primary production ingestion posture
PDFremains fallback, not the primary route- when PDF fallback is used,
GROBIDis the only maintained engine on the scholarly parse path - generic project-file fallback belongs in the isolated
MarkItDownsidecar track, not in the default backend runtime dependency set - canonical route semantics in the experimental line are now
source_first,jats_or_structured_xml_first,api_first,html_helper_first,epub_first,pdf_fallback_only, andlegacy_parse - the planned Python import API should be a cloud parse SDK:
from mdtero import Mdterowraps hosted parse tasks and artifact download, while local parser modules remain backend internals
Primary internal references:
docs/superpowers/specs/2026-03-25-parser-kernel-v2-design.mddocs/superpowers/specs/2026-03-25-parser-kernel-v2-alignment-audit.mddocs/superpowers/specs/2026-03-25-helper-extension-browser-bridge-design.mdlabs/publisher_ingestion_probe/README.mdlabs/local_helper_playbook/README.mdlabs/vendor_promotion_validation/README.md
Grounded Chat Contract
The shared workspace Notebook rail depends on POST /threads/{thread_id}/messages as the backend SSOT for grounded project chat.
Current host posture:
apps/site-nextconsumes this same thread-message contract through host-local transport layers- frontend adapters normalize the payload before it reaches shared workspace components, so the backend response shape here remains the SSOT contract rather than a UI-only fork
- grounded chat is a workspace enhancement layer over parsed Mdtero documents, not a replacement for parsing and not a cross-project global assistant
Current V1 contract:
- request body always accepts
content - request body may also carry:
scope_type:document,selection, orprojectdocument_ids: selected project document idsmode:groundedorsynthesiscitation_limit: optional citation cap for the returned answer
- backward compatibility is preserved:
- plain
{ "content": "..." }still works - omitted scope defaults to
project - omitted mode defaults to
grounded
- plain
Current V1 response shape includes:
answercitationsused_embeddingsretrieval_strategyscope_summary
Retrieval posture on 2026-04-23:
- default retrieval strategy is
lexical_v1 used_embeddingsstaysfalsein the default path- local project chat now reuses this same contract shape through
mdtero chat --json, with evidence restricted to project-ownedrag_chunks - scope-aware retrieval is supported for:
- single document
- selected documents inside a project
- whole-project search
message_citationsremains the persistence SSOT for assistant citation rows on the backend thread route; the local helper path projects compatible citation payloads without inventing a second response contract- answer generation now flows through a backend adapter so provider-specific behavior can change later without changing the route contract or dropping citation traceability
- future user-supplied LLM keys or embedding providers may sit behind this adapter, but Mdtero-owned parsing, retrieval scope, and citation payloads remain the fixed contract
Local Development
python3.12 -m venv .venv
source .venv/bin/activate
python3 -m unittest discover -s tests -v
python3 -m uvicorn service.main:app --reload
Python 3.12 is the intended local baseline. Parts of the service test surface now use modern union syntax such as str | None, so older 3.9-only virtualenvs will fail during collection even before the relevant backend code runs.
Install surface guidance:
pip install -r requirements.txt- full developer install; includes both runtime deps and local/shadow tooling
pip install -r requirements-prod.txt- lean deployed/runtime surface only; this is what the production image installs
- do not add local replay / browser / scraping extras here just because they are useful on a laptop
pip install -r requirements-local.txt- local source-first and shadow add-ons such as
curl_cffi,pyzotero,paperscraper, andpubmed-parser - these are local Python backend/tooling dependencies, not packages bundled into the npm CLI or browser extension
- local source-first and shadow add-ons such as
Typical local developer bootstrap:
pip install -r requirements.txt
Pytest-backed service suite:
pip install -r requirements-test.txt
./.venv/bin/pytest tests/service -q
Full repository pytest sweep:
./.venv/bin/pytest -q
Canonical pre-launch readiness command
The older representative journey gate entrypoint has been archived and is no longer an active backend validation surface. Historical journey-gate materials now live under archive/validation/.
Current active verification entrypoints should be taken from the live parsing and topic-batch surfaces documented elsewhere in this README rather than from the archived journey gate.
Maintenance Rules
- keep credentials, publisher-side helpers, and operator workflows private
- prefer tested behavior in
service/over one-off scripts - keep agent install docs aligned with the current beta onboarding path
- treat this repository as the release gate for production auth, billing, parsing, translation, and helper-serving behavior
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdtero-2026.4.26.1.tar.gz.
File metadata
- Download URL: mdtero-2026.4.26.1.tar.gz
- Upload date:
- Size: 468.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bd37b40e5faf01809e1d7a40d8c0f220e72e06d427b539c77be6de87c2d65c3
|
|
| MD5 |
a46f32d8d43a6348f61e984c70c5587b
|
|
| BLAKE2b-256 |
347d6981857d83f268dff79fad478dba9639c273644e642378c494ce125c168f
|
Provenance
The following attestation bundles were made for mdtero-2026.4.26.1.tar.gz:
Publisher:
publish-mdtero-cli.yml on JonbinC/mdtero-backend
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdtero-2026.4.26.1.tar.gz -
Subject digest:
7bd37b40e5faf01809e1d7a40d8c0f220e72e06d427b539c77be6de87c2d65c3 - Sigstore transparency entry: 1537562197
- Sigstore integration time:
-
Permalink:
JonbinC/mdtero-backend@929ba3ede21062cbb009c3da765d457afa4864cf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/JonbinC
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-mdtero-cli.yml@929ba3ede21062cbb009c3da765d457afa4864cf -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file mdtero-2026.4.26.1-py3-none-any.whl.
File metadata
- Download URL: mdtero-2026.4.26.1-py3-none-any.whl
- Upload date:
- Size: 585.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d806d363f58fcb7b19d7f2d4e585c6b4dcc416460536e540368daed9590b64d4
|
|
| MD5 |
ccbbf288f85a511f7f763992a6a6a7a8
|
|
| BLAKE2b-256 |
f191ab520b96bc7d8f58edd28902b1db33cbe7818deeb1b87f9c86e257ad3ef7
|
Provenance
The following attestation bundles were made for mdtero-2026.4.26.1-py3-none-any.whl:
Publisher:
publish-mdtero-cli.yml on JonbinC/mdtero-backend
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdtero-2026.4.26.1-py3-none-any.whl -
Subject digest:
d806d363f58fcb7b19d7f2d4e585c6b4dcc416460536e540368daed9590b64d4 - Sigstore transparency entry: 1537562239
- Sigstore integration time:
-
Permalink:
JonbinC/mdtero-backend@929ba3ede21062cbb009c3da765d457afa4864cf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/JonbinC
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-mdtero-cli.yml@929ba3ede21062cbb009c3da765d457afa4864cf -
Trigger Event:
workflow_dispatch
-
Statement type: