Skip to main content

Mdtero local CLI for source-first paper acquisition, parsing, Zotero import, and RAG handoff.

Project description

Mdtero Private Backend

This repository is the production backend source of truth for Mdtero.

Public beta messaging lives in the public repos. This repository stays focused on:

  • FastAPI services and task orchestration
  • auth, API keys, billing, and usage accounting
  • parsing, translation, artifact packaging, and retrieval adapters
  • source-first agent skill documents served from api.mdtero.com

Public / Private Split

  • mdtero: website, dashboard, guide, API docs, and public extension source/install guidance
  • mdtero-backend: backend-only services, deployment assets, source-routing CLI surfaces, and private operator docs
  • Current first-party extension scope is limited to web OAuth login/session handoff, parse, translate, PDF/EPUB upload, artifact download, and install guidance; do not add extension UI/source here.

Deployment Truth

  • Production mdtero-api is served from OCI SG AMD-2 behind Caddy at https://api.mdtero.com; pushes to main no longer deploy Cloud Run automatically.
  • Use the manual GitHub Actions workflow in .github/workflows/deploy-backend.yml to run tests, build/push the production image to GHCR as ghcr.io/jonbinc/mdtero-api, execute the deploy job on the AMD-2 self-hosted runner, update /opt/mdtero-api/docker-compose.yml, and smoke-test production. See docs/ops/amd2-backend-deploy.md.
  • cloudbuild.yaml is retained as a rollback-only Cloud Run definition. Do not attach it to an automatic Cloud Build trigger unless intentionally rolling production back to Cloud Run.
  • Local filesystem storage on AMD-2 is the production default artifact backend: STORAGE_BACKEND=local, LOCAL_STORAGE_DIR=/app/storage, host path /opt/mdtero-api/storage, container path /app/storage.
  • GCS artifact storage remains rollback-only / legacy compatibility; do not treat GCS_BUCKET_NAME or GOOGLE_APPLICATION_CREDENTIALS as the default AMD-2 production path.
  • Live discovery search is production-enabled only after the runtime secret store contains the provider key; bind OPENALEX_API_KEY and set MDTERO_ENABLE_DISCOVERY_SEARCH=true in the same rollout. Do not put the key value in source, and do not reference a missing secret from the default deploy path.
  • Standalone GROBID is retired and is not part of the GitHub/Cloud Run production deployment chain; docs/ops/standalone-grobid-cloud-run.md is kept only as an archival note.
  • Direct upload-PDF / cloud PDF handling is the maintained PDF path. The longer-term direction is to move the remaining scholarly PDF credentials and routing into MinerU-managed production configuration rather than bringing back a separate GROBID deploy.
  • This repository is the only backend code SSOT; do not recreate a shadow backend under mdtero/private/backend or any other parallel path.
  • Do not assume a push to mdtero deploys the public backend API.

When the public install flow changes, update the matching skill or install guide here in the same round.

Install / Bootstrap Truth

  • The formal mdtero CLI release path is PyPI + uv tool install mdtero.
  • curl -Ls https://mdtero.com/install.sh | sh -s -- --agent <target> remains a legacy/bootstrap path until the public install surfaces are fully switched over.
  • npm install -g mdtero-install@0.1.8 is a legacy/bootstrap package, not the formal mdtero CLI release channel.
  • This repository already defines the intended Python package in pyproject.toml: project.name = "mdtero" and project.scripts.mdtero = "mdtero_cli:main".
  • npx mdtero-install install <target> installs the matching agent skill bundle; it is not part of the formal mdtero CLI release gate.
  • uv tool install mdtero is the formal user install path for the published Python package; that package includes the local CLI surfaces for curl_cffi acquisition, Zotero import, RAG, MCP handoff, parse, translate, status, and download.
  • OpenClaw remains separate: use clawhub install mdtero, not the install-script target list.

Maintainer-only Python package smoke before publishing:

python3.12 -m venv .venv
./.venv/bin/python -m pip install -r requirements.txt
./.venv/bin/python -m pip wheel --no-deps -w /tmp/mdtero-wheel .
UV_TOOL_DIR=/tmp/mdtero-tools UV_TOOL_BIN_DIR=/tmp/mdtero-bin \
  uv tool install /tmp/mdtero-wheel/mdtero-*.whl --python 3.12
/tmp/mdtero-bin/mdtero --help

The smoke must show one installed mdtero executable and mdtero --help must list parse-files, zotero import, rag serve, and the other local runtime commands. Publishing to PyPI is a separate maintainer action and is the only formal external release step for the CLI.

For maintainers, the repo now separates package verification from external release:

  • .github/workflows/package-mdtero-cli.yml builds and smokes the package without publishing.
  • .github/workflows/publish-mdtero-cli.yml is the manual Trusted Publishing path for PyPI after the mdtero project on PyPI is linked to this GitHub repository.
  • Before the first external publish, create the PyPI account/project binding for mdtero, enable Trusted Publishing for this repository/workflow, and then trigger the publish workflow manually.

Repository Map

Experimental Parsing Progress

The production parse chain is still the release gate.

In parallel, the experimental parser_v2 line under service/parser_v2 is now materially ahead on architecture and source coverage:

  • shared AST -> Markdown kernel
  • JATS, Elsevier XML, and TEI importers
  • server-side curl_cffi HTML/XML/EPUB parsing with publisher adapters and quality gates
  • local/OA EPUB parsing
  • experimental PDF -> GROBID -> TEI -> AST -> Markdown fallback
  • experimental unified upload entrypoint POST /tasks/parse-fulltext-v2 for structured full-text handoff
  • authenticated runtime diagnostics entrypoint GET /diagnostics/parser-v2/shadow for shadow-flag visibility

Mainline adoption status on 2026-03-25:

  • arxiv_native is already executed through the V2 AST/Markdown kernel on the production parse path
  • generic structured XML uploads now route through the V2 structured-XML path for supported families (Elsevier XML, JATS, TEI) even when entering from the legacy /tasks/parse-upload surface
  • remote structured-fulltext routes that land on uploaded XML parsing therefore also benefit from the same V2 normalization path
  • PDF -> GROBID -> TEI -> AST -> Markdown is now considered the only maintained scholarly PDF fallback path through the V2 upload surfaces, and should remain low-profile rather than a promoted primary route
  • non-PDF project attachments are being standardized behind a separate MarkItDown sidecar plan so generic file ingestion does not pollute the scholarly parser runtime
  • Playwright, browser extension capture, browser bridge, helper-bundle upload, and helper self-update are retired from the maintained CLI/runtime path
  • legacy XML upload wrappers are now thin delegators into service/parser_v2/uploaded_parse.py
  • production parse subprocess now enters through python -m service.parse_cli, which keeps the runtime entry stable while the underlying implementation continues migrating
  • service.parse_cli now delegates into service/parser_v2/cli.py instead of importing the root legacy parser directly
  • the arXiv runtime flow is now owned by service/parser_v2/arxiv_runtime.py; legacy arXiv compatibility code lives under service/legacy_parser/
  • AI-markdown sidecar rendering is now owned by service/parser_v2/markdown.py rather than the root parser module
  • structured XML figure/table asset localization should now be source-first:
    • importers should preserve native figure references where available
    • missing assets should be fused from publisher HTML/PDF/MinerU sources where lawful and available
  • raw full-text cache policy is now short-lived by default:
    • L1 ephemeral execution cache stays at 24h
    • L2 user-private raw cache now defaults to 7 days
    • L4 public-open raw cache now defaults to 7 days
  • source-first production rollout should prefer native/API/XML/EPUB/HTML routes before PDF fallback:
    • /tasks/parse should route DOI / URL acquisition through server/native/legal machine-friendly formats when available
    • local-only content should enter through direct upload or Zotero attachment import, not browser automation
    • CLI / signed-in frontend can inspect current connector shadow posture through GET /diagnostics/parser-v2/shadow
    • local project authority for the helper CLI now starts with mdtero init and can be inspected with mdtero status --json, which reports .mdtero/ identity, config-source precedence, and stored diagnostics without implying parse readiness
  • legacy parser compatibility code now lives under service/legacy_parser/; repository root should not host parser entrypoints
  • the current Python ownership map and archive boundary are tracked in docs/PYTHON_SURFACE_MAP.md
  • repository boundary rules for runs/, labs/, tmp/, and scripts/ are tracked in docs/RUNTIME_BOUNDARY_RULES.md

Current experimentally validated connector status:

  • promotion-ready now:
    • arxiv_native
    • europe_pmc
    • plos
    • elife_jats_xml
    • biorxiv
    • medrxiv
    • mdpi_epub_asset
    • springer_openaccess_api
    • elsevier_article_retrieval_api
    • springer_subscription_connector
    • wiley_tdm
    • taylor_francis_tdm

Important rollout interpretation:

  • promotion-ready means a connector is strong enough for shadow / feature-flag, not that it should immediately become the new global production default
  • the current single-source summary for shadow vs default cutover is:
    • labs/vendor_promotion_validation/MATURITY_MATRIX.md
    • runs/vendor_promotion_validation/shadow-rollout.json
  • as of 2026-03-27, the practical posture is:
    • already-live V2 behavior: arxiv_native, uploaded structured XML, uploaded PDF fallback, source-first HTML/XML/EPUB routing
    • next shadow-first connectors: springer_subscription_connector, wiley_tdm, taylor_francis_tdm, springer_openaccess_api, elsevier_article_retrieval_api
    • still not valid to describe as default production automatic acquisition: Wiley browser-bridge HTML, MDPI direct server fetch, Taylor & Francis OA EPUB

Practical acquisition interpretation:

  • Europe PMC, PLOS, eLife, Springer OA, bioRxiv, medRxiv, and MDPI EPUB are already executable open structured routes
  • Elsevier is modeled as api_first under user entitlement or authorized API environment; Mdtero does not imply public access
  • Elsevier XML uploaded or fetched through authorized acquisition lands on the V2 structured importer path on the main backend surface
  • Wiley, Springer subscription, and Taylor & Francis should prefer source-first HTML/XML/PDF routes where curl_cffi or official APIs can fetch without browser automation
  • Wiley has experimental official TDM evidence:
    • Wiley TDM PDF -> GROBID -> Markdown
    • live validation on this machine succeeded for 10.1002/er.7490, 10.1002/er.6487, and 10.1002/sam.11700
    • observed fetch time was about 2.5s - 4.1s, with end-to-end PDF -> Markdown around 8s on a warm local GROBID container
  • Browser acquisition is no longer a product route. Historical browser-extension and Playwright validation remains useful as archive evidence only; do not revive it as fallback.

Current architectural direction:

  • server should increasingly act as discovery, routing, parsing, normalization, rendering, and structured persistence
  • production acquisition should default to source-first server/native routes or explicit user-provided files/attachments
  • helper/browser-first is retired; direct server-side DOI/URL fetching with lawful machine-friendly formats is the release target when quality gates pass
  • local project ingest now has a project-owned pre-parse ledger: mdtero ingest records DOI/URL/local-file provenance into .mdtero/state/ingest-ledger.sqlite3, and mdtero papers projects honest readiness states such as metadata_only, oa_location_found, fulltext_staged, and manual_action_required without implying parse success
  • local project parse now extends that same ledger with append-only parse_attempts, and mdtero parse --json operates on existing ingested project records instead of bypassing project authority
  • local project Zotero import now supports both fixture replay and real read-only library access: mdtero zotero import --fixture tests/service/fixtures/zotero/sample-library.json --json replays fixture data, while mdtero zotero import --library-id <id> --library-type user|group --api-key <key> --json (or --local for Zotero local API) records durable Zotero mappings plus attachment discovery into the same ledger without promising sync or write-back
  • local project RAG/chat now extends the same ledger with rag_builds and rag_chunks; mdtero rag build --json indexes parsed Markdown into project-owned retrieval state and mdtero chat --json answers from local lexical evidence while preserving the shared grounded-chat response shape
  • local project dashboard now projects the same ledger through mdtero dashboard --json and plain-text mdtero dashboard, keeping planned_lane, actual_lane, parser_label, artifact_outcome, reason_code, Zotero attachment evidence, RAG readiness, and derived operator actions visible without introducing dashboard-owned state
  • the first local parse path is intentionally narrow: staged local PDFs only, routed through the uploaded-PDF adapter seam so backend-owned lane/parser/failure vocabulary remains authoritative
  • mdtero papers --json now exposes latest parse status plus durable parse history so downstream Zotero/RAG/TUI slices can consume the same local truth without reparsing CLI text
  • server-side fetch should remain an optional convenience or coverage fallback, not the primary production ingestion posture
  • PDF remains fallback, not the primary route
  • when PDF fallback is used, GROBID is the only maintained engine on the scholarly parse path
  • generic project-file fallback belongs in the isolated MarkItDown sidecar track, not in the default backend runtime dependency set
  • canonical route semantics in the experimental line are now source_first, jats_or_structured_xml_first, api_first, html_helper_first, epub_first, pdf_fallback_only, and legacy_parse
  • the planned Python import API should be a cloud parse SDK: from mdtero import Mdtero wraps hosted parse tasks and artifact download, while local parser modules remain backend internals

Primary internal references:

Grounded Chat Contract

The shared workspace Notebook rail depends on POST /threads/{thread_id}/messages as the backend SSOT for grounded project chat.

Current host posture:

  • apps/site-next consumes this same thread-message contract through host-local transport layers
  • frontend adapters normalize the payload before it reaches shared workspace components, so the backend response shape here remains the SSOT contract rather than a UI-only fork
  • grounded chat is a workspace enhancement layer over parsed Mdtero documents, not a replacement for parsing and not a cross-project global assistant

Current V1 contract:

  • request body always accepts content
  • request body may also carry:
    • scope_type: document, selection, or project
    • document_ids: selected project document ids
    • mode: grounded or synthesis
    • citation_limit: optional citation cap for the returned answer
  • backward compatibility is preserved:
    • plain { "content": "..." } still works
    • omitted scope defaults to project
    • omitted mode defaults to grounded

Current V1 response shape includes:

  • answer
  • citations
  • used_embeddings
  • retrieval_strategy
  • scope_summary

Retrieval posture on 2026-04-23:

  • default retrieval strategy is lexical_v1
  • used_embeddings stays false in the default path
  • local project chat now reuses this same contract shape through mdtero chat --json, with evidence restricted to project-owned rag_chunks
  • scope-aware retrieval is supported for:
    • single document
    • selected documents inside a project
    • whole-project search
  • message_citations remains the persistence SSOT for assistant citation rows on the backend thread route; the local helper path projects compatible citation payloads without inventing a second response contract
  • answer generation now flows through a backend adapter so provider-specific behavior can change later without changing the route contract or dropping citation traceability
  • future user-supplied LLM keys or embedding providers may sit behind this adapter, but Mdtero-owned parsing, retrieval scope, and citation payloads remain the fixed contract

Local Development

python3.12 -m venv .venv
source .venv/bin/activate
python3 -m unittest discover -s tests -v
python3 -m uvicorn service.main:app --reload

Python 3.12 is the intended local baseline. Parts of the service test surface now use modern union syntax such as str | None, so older 3.9-only virtualenvs will fail during collection even before the relevant backend code runs.

Install surface guidance:

  • pip install -r requirements.txt
    • full developer install; includes both runtime deps and local/shadow tooling
  • pip install -r requirements-prod.txt
    • lean deployed/runtime surface only; this is what the production image installs
    • do not add local replay / browser / scraping extras here just because they are useful on a laptop
  • pip install -r requirements-local.txt
    • local source-first and shadow add-ons such as curl_cffi, pyzotero, paperscraper, and pubmed-parser
    • these are local Python backend/tooling dependencies, not packages bundled into the npm CLI or browser extension

Typical local developer bootstrap:

pip install -r requirements.txt

Pytest-backed service suite:

pip install -r requirements-test.txt
./.venv/bin/pytest tests/service -q

Full repository pytest sweep:

./.venv/bin/pytest -q

Canonical pre-launch readiness command

The older representative journey gate entrypoint has been archived and is no longer an active backend validation surface. Historical journey-gate materials now live under archive/validation/.

Current active verification entrypoints should be taken from the live parsing and topic-batch surfaces documented elsewhere in this README rather than from the archived journey gate.

Maintenance Rules

  • keep credentials, publisher-side helpers, and operator workflows private
  • prefer tested behavior in service/ over one-off scripts
  • keep agent install docs aligned with the current beta onboarding path
  • treat this repository as the release gate for production auth, billing, parsing, translation, and helper-serving behavior

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdtero-2026.4.26.1.tar.gz (468.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdtero-2026.4.26.1-py3-none-any.whl (585.0 kB view details)

Uploaded Python 3

File details

Details for the file mdtero-2026.4.26.1.tar.gz.

File metadata

  • Download URL: mdtero-2026.4.26.1.tar.gz
  • Upload date:
  • Size: 468.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mdtero-2026.4.26.1.tar.gz
Algorithm Hash digest
SHA256 7bd37b40e5faf01809e1d7a40d8c0f220e72e06d427b539c77be6de87c2d65c3
MD5 a46f32d8d43a6348f61e984c70c5587b
BLAKE2b-256 347d6981857d83f268dff79fad478dba9639c273644e642378c494ce125c168f

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdtero-2026.4.26.1.tar.gz:

Publisher: publish-mdtero-cli.yml on JonbinC/mdtero-backend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mdtero-2026.4.26.1-py3-none-any.whl.

File metadata

  • Download URL: mdtero-2026.4.26.1-py3-none-any.whl
  • Upload date:
  • Size: 585.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mdtero-2026.4.26.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d806d363f58fcb7b19d7f2d4e585c6b4dcc416460536e540368daed9590b64d4
MD5 ccbbf288f85a511f7f763992a6a6a7a8
BLAKE2b-256 f191ab520b96bc7d8f58edd28902b1db33cbe7818deeb1b87f9c86e257ad3ef7

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdtero-2026.4.26.1-py3-none-any.whl:

Publisher: publish-mdtero-cli.yml on JonbinC/mdtero-backend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page