Mdtero local CLI for source-first paper acquisition, parsing, Zotero import, and RAG handoff.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

mdtero

These details have not been verified by PyPI

Project links

Homepage

Project description

Mdtero Private Backend

This repository is the production backend source of truth for Mdtero.

Public beta messaging lives in the public repos. This repository stays focused on:

FastAPI services and task orchestration
auth, API keys, billing, and usage accounting
parsing, translation, artifact packaging, and retrieval adapters
source-first agent skill documents served from api.mdtero.com

Public / Private Split

mdtero: website, dashboard, guide, API docs, and public extension source/install guidance
mdtero-backend: backend-only services, deployment assets, source-routing CLI surfaces, and private operator docs
Current first-party extension scope is limited to web OAuth login/session handoff, parse, translate, PDF/EPUB upload, artifact download, and install guidance; do not add extension UI/source here.

Deployment Truth

Production mdtero-api is served from OCI SG AMD-2 behind Caddy at https://api.mdtero.com; pushes to main no longer deploy Cloud Run automatically.
Use the manual GitHub Actions workflow in .github/workflows/deploy-backend.yml to run tests, build/push the production image to GHCR as ghcr.io/jonbinc/mdtero-api, execute the deploy job on the AMD-2 self-hosted runner, update /opt/mdtero-api/docker-compose.yml, and smoke-test production. See docs/ops/amd2-backend-deploy.md.
cloudbuild.yaml is retained as a rollback-only Cloud Run definition. Do not attach it to an automatic Cloud Build trigger unless intentionally rolling production back to Cloud Run.
Local filesystem storage on AMD-2 is the production default artifact backend: STORAGE_BACKEND=local, LOCAL_STORAGE_DIR=/app/storage, host path /opt/mdtero-api/storage, container path /app/storage.
GCS artifact storage remains rollback-only / legacy compatibility; do not treat GCS_BUCKET_NAME or GOOGLE_APPLICATION_CREDENTIALS as the default AMD-2 production path.
Live discovery search is production-enabled only after the runtime secret store contains the provider key; bind OPENALEX_API_KEY and set MDTERO_ENABLE_DISCOVERY_SEARCH=true in the same rollout. Do not put the key value in source, and do not reference a missing secret from the default deploy path.
Standalone GROBID is retired and is not part of the GitHub/Cloud Run production deployment chain; docs/ops/standalone-grobid-cloud-run.md is kept only as an archival note.
Direct upload-PDF / cloud PDF handling is the maintained PDF path. The longer-term direction is to move the remaining scholarly PDF credentials and routing into MinerU-managed production configuration rather than bringing back a separate GROBID deploy.
This repository is the only backend code SSOT; do not recreate a shadow backend under mdtero/private/backend or any other parallel path.
Do not assume a push to mdtero deploys the public backend API.

When the public install flow changes, update the matching skill or install guide here in the same round.

Install / Bootstrap Truth

The formal mdtero CLI release path is PyPI + uv tool install mdtero.
curl -Ls https://mdtero.com/install.sh | sh -s -- --agent <target> remains a legacy/bootstrap path until the public install surfaces are fully switched over.
npm install -g mdtero-install@0.1.8 is a legacy/bootstrap package, not the formal mdtero CLI release channel.
This repository already defines the intended Python package in pyproject.toml: project.name = "mdtero" and project.scripts.mdtero = "mdtero_cli:main".
npx mdtero-install install <target> installs the matching agent skill bundle; it is not part of the formal mdtero CLI release gate.
uv tool install mdtero is the formal user install path for the published Python package; that package includes the local CLI surfaces for curl_cffi acquisition, Zotero import, RAG, MCP handoff, parse, translate, status, and download.
OpenClaw remains separate: use clawhub install mdtero, not the install-script target list.

Maintainer-only Python package smoke before publishing:

python3.12 -m venv .venv
./.venv/bin/python -m pip install -r requirements.txt
./.venv/bin/python -m pip wheel --no-deps -w /tmp/mdtero-wheel .
UV_TOOL_DIR=/tmp/mdtero-tools UV_TOOL_BIN_DIR=/tmp/mdtero-bin \
  uv tool install /tmp/mdtero-wheel/mdtero-*.whl --python 3.12
/tmp/mdtero-bin/mdtero --help

The smoke must show one installed mdtero executable and mdtero --help must list parse-files, zotero import, rag serve, and the other local runtime commands. Publishing to PyPI is a separate maintainer action and is the only formal external release step for the CLI.

For maintainers, the repo now separates package verification from external release:

.github/workflows/package-mdtero-cli.yml builds and smokes the package without publishing.
.github/workflows/publish-mdtero-cli.yml is the manual Trusted Publishing path for PyPI after the mdtero project on PyPI is linked to this GitHub repository.
Before the first external publish, create the PyPI account/project binding for mdtero, enable Trusted Publishing for this repository/workflow, and then trigger the publish workflow manually.

Repository Map

service: app routes, orchestration, auth, billing, and provider adapters
tests/service: backend test coverage
service/legacy_parser: quarantined compatibility parser code that no longer owns runtime design
tests/legacy_parser: regression coverage for compatibility parser code
skills: install and workflow documents served to agents
scripts: operator and migration utilities
docs: internal product and engineering notes
- docs/partner: partner-facing API, capability, and maturity package
- docs/feedback_audit_ssot.md: quality feedback review SSOT for human and AI-assisted auditing
- docs/architecture/backend-ssot-refactor-blueprint.md: backend-first SSOT and readability refactor blueprint
- docs/superpowers/specs/2026-05-06-cloud-parse-sdk-design.md: planned Python import API boundary for cloud parse tasks
- docs/PYTHON_SURFACE_MAP.md: runtime-vs-compatibility ownership map for Python files
- docs/RUNTIME_BOUNDARY_RULES.md: repository boundary rules for runtime owners vs. labs, runs, tmp, and scripts
- docs/SCRIPT_SURFACE_INDEX.md: maintained script-bucket ownership index

Experimental Parsing Progress

The production parse chain is still the release gate.

In parallel, the experimental parser_v2 line under service/parser_v2 is now materially ahead on architecture and source coverage:

shared AST -> Markdown kernel
JATS, Elsevier XML, and TEI importers
server-side curl_cffi HTML/XML/EPUB parsing with publisher adapters and quality gates
local/OA EPUB parsing
experimental PDF -> GROBID -> TEI -> AST -> Markdown fallback
experimental unified upload entrypoint POST /tasks/parse-fulltext-v2 for structured full-text handoff
authenticated runtime diagnostics entrypoint GET /diagnostics/parser-v2/shadow for shadow-flag visibility

Mainline adoption status on 2026-03-25:

arxiv_native is already executed through the V2 AST/Markdown kernel on the production parse path
generic structured XML uploads now route through the V2 structured-XML path for supported families (Elsevier XML, JATS, TEI) even when entering from the legacy /tasks/parse-upload surface
remote structured-fulltext routes that land on uploaded XML parsing therefore also benefit from the same V2 normalization path
PDF -> GROBID -> TEI -> AST -> Markdown is now considered the only maintained scholarly PDF fallback path through the V2 upload surfaces, and should remain low-profile rather than a promoted primary route
non-PDF project attachments are being standardized behind a separate MarkItDown sidecar plan so generic file ingestion does not pollute the scholarly parser runtime
Playwright, browser extension capture, browser bridge, helper-bundle upload, and helper self-update are retired from the maintained CLI/runtime path
legacy XML upload wrappers are now thin delegators into service/parser_v2/uploaded_parse.py
production parse subprocess now enters through python -m service.parse_cli, which keeps the runtime entry stable while the underlying implementation continues migrating
service.parse_cli now delegates into service/parser_v2/cli.py instead of importing the root legacy parser directly
the arXiv runtime flow is now owned by service/parser_v2/arxiv_runtime.py; legacy arXiv compatibility code lives under service/legacy_parser/
AI-markdown sidecar rendering is now owned by service/parser_v2/markdown.py rather than the root parser module
structured XML figure/table asset localization should now be source-first:
- importers should preserve native figure references where available
- missing assets should be fused from publisher HTML/PDF/MinerU sources where lawful and available
raw full-text cache policy is now short-lived by default:
- L1 ephemeral execution cache stays at 24h
- L2 user-private raw cache now defaults to 7 days
- L4 public-open raw cache now defaults to 7 days
source-first production rollout should prefer native/API/XML/EPUB/HTML routes before PDF fallback:
- /tasks/parse should route DOI / URL acquisition through server/native/legal machine-friendly formats when available
- local-only content should enter through direct upload or Zotero attachment import, not browser automation
- CLI / signed-in frontend can inspect current connector shadow posture through GET /diagnostics/parser-v2/shadow
- local project authority for the helper CLI now starts with mdtero init and can be inspected with mdtero status --json, which reports .mdtero/ identity, config-source precedence, and stored diagnostics without implying parse readiness
legacy parser compatibility code now lives under service/legacy_parser/; repository root should not host parser entrypoints
the current Python ownership map and archive boundary are tracked in docs/PYTHON_SURFACE_MAP.md
repository boundary rules for runs/, labs/, tmp/, and scripts/ are tracked in docs/RUNTIME_BOUNDARY_RULES.md

Current experimentally validated connector status:

promotion-ready now:
- arxiv_native
- europe_pmc
- plos
- elife_jats_xml
- biorxiv
- medrxiv
- mdpi_epub_asset
- springer_openaccess_api
- elsevier_article_retrieval_api
- springer_subscription_connector
- wiley_tdm
- taylor_francis_tdm

Important rollout interpretation:

promotion-ready means a connector is strong enough for shadow / feature-flag, not that it should immediately become the new global production default
the current single-source summary for shadow vs default cutover is:
- labs/vendor_promotion_validation/MATURITY_MATRIX.md
- runs/vendor_promotion_validation/shadow-rollout.json
as of 2026-03-27, the practical posture is:
- already-live V2 behavior: arxiv_native, uploaded structured XML, uploaded PDF fallback, source-first HTML/XML/EPUB routing
- next shadow-first connectors: springer_subscription_connector, wiley_tdm, taylor_francis_tdm, springer_openaccess_api, elsevier_article_retrieval_api
- still not valid to describe as default production automatic acquisition: Wiley browser-bridge HTML, MDPI direct server fetch, Taylor & Francis OA EPUB

Practical acquisition interpretation:

Europe PMC, PLOS, eLife, Springer OA, bioRxiv, medRxiv, and MDPI EPUB are already executable open structured routes
Elsevier is modeled as api_first under user entitlement or authorized API environment; Mdtero does not imply public access
Elsevier XML uploaded or fetched through authorized acquisition lands on the V2 structured importer path on the main backend surface
Wiley, Springer subscription, and Taylor & Francis should prefer source-first HTML/XML/PDF routes where curl_cffi or official APIs can fetch without browser automation
Wiley has experimental official TDM evidence:
- Wiley TDM PDF -> GROBID -> Markdown
- live validation on this machine succeeded for 10.1002/er.7490, 10.1002/er.6487, and 10.1002/sam.11700
- observed fetch time was about 2.5s - 4.1s, with end-to-end PDF -> Markdown around 8s on a warm local GROBID container
Browser acquisition is no longer a product route. Historical browser-extension and Playwright validation remains useful as archive evidence only; do not revive it as fallback.

Current architectural direction:

server should increasingly act as discovery, routing, parsing, normalization, rendering, and structured persistence
production acquisition should default to source-first server/native routes or explicit user-provided files/attachments
helper/browser-first is retired; direct server-side DOI/URL fetching with lawful machine-friendly formats is the release target when quality gates pass
local project ingest now has a project-owned pre-parse ledger: mdtero ingest records DOI/URL/local-file provenance into .mdtero/state/ingest-ledger.sqlite3, and mdtero papers projects honest readiness states such as metadata_only, oa_location_found, fulltext_staged, and manual_action_required without implying parse success
local project parse now extends that same ledger with append-only parse_attempts, and mdtero parse --json operates on existing ingested project records instead of bypassing project authority
local project Zotero import now supports both fixture replay and real read-only library access: mdtero zotero import --fixture tests/service/fixtures/zotero/sample-library.json --json replays fixture data, while mdtero zotero import --library-id <id> --library-type user|group --api-key <key> --json (or --local for Zotero local API) records durable Zotero mappings plus attachment discovery into the same ledger without promising sync or write-back
local project RAG/chat now extends the same ledger with rag_builds and rag_chunks; mdtero rag build --json indexes parsed Markdown into project-owned retrieval state and mdtero chat --json answers from local lexical evidence while preserving the shared grounded-chat response shape
local project dashboard now projects the same ledger through mdtero dashboard --json and plain-text mdtero dashboard, keeping planned_lane, actual_lane, parser_label, artifact_outcome, reason_code, Zotero attachment evidence, RAG readiness, and derived operator actions visible without introducing dashboard-owned state
the first local parse path is intentionally narrow: staged local PDFs only, routed through the uploaded-PDF adapter seam so backend-owned lane/parser/failure vocabulary remains authoritative
mdtero papers --json now exposes latest parse status plus durable parse history so downstream Zotero/RAG/TUI slices can consume the same local truth without reparsing CLI text
server-side fetch should remain an optional convenience or coverage fallback, not the primary production ingestion posture
PDF remains fallback, not the primary route
when PDF fallback is used, GROBID is the only maintained engine on the scholarly parse path
generic project-file fallback belongs in the isolated MarkItDown sidecar track, not in the default backend runtime dependency set
canonical route semantics in the experimental line are now source_first, jats_or_structured_xml_first, api_first, html_helper_first, epub_first, pdf_fallback_only, and legacy_parse
the planned Python import API should be a cloud parse SDK: from mdtero import Mdtero wraps hosted parse tasks and artifact download, while local parser modules remain backend internals

Primary internal references:

Grounded Chat Contract

The shared workspace Notebook rail depends on POST /threads/{thread_id}/messages as the backend SSOT for grounded project chat.

Current host posture:

apps/site-next consumes this same thread-message contract through host-local transport layers
frontend adapters normalize the payload before it reaches shared workspace components, so the backend response shape here remains the SSOT contract rather than a UI-only fork
grounded chat is a workspace enhancement layer over parsed Mdtero documents, not a replacement for parsing and not a cross-project global assistant

Current V1 contract:

request body always accepts content
request body may also carry:
- scope_type: document, selection, or project
- document_ids: selected project document ids
- mode: grounded or synthesis
- citation_limit: optional citation cap for the returned answer
backward compatibility is preserved:
- plain { "content": "..." } still works
- omitted scope defaults to project
- omitted mode defaults to grounded

Current V1 response shape includes:

answer
citations
used_embeddings
retrieval_strategy
scope_summary

Retrieval posture on 2026-04-23:

default retrieval strategy is lexical_v1
used_embeddings stays false in the default path
local project chat now reuses this same contract shape through mdtero chat --json, with evidence restricted to project-owned rag_chunks
scope-aware retrieval is supported for:
- single document
- selected documents inside a project
- whole-project search
message_citations remains the persistence SSOT for assistant citation rows on the backend thread route; the local helper path projects compatible citation payloads without inventing a second response contract
answer generation now flows through a backend adapter so provider-specific behavior can change later without changing the route contract or dropping citation traceability
future user-supplied LLM keys or embedding providers may sit behind this adapter, but Mdtero-owned parsing, retrieval scope, and citation payloads remain the fixed contract

Local Development

python3.12 -m venv .venv
source .venv/bin/activate
python3 -m unittest discover -s tests -v
python3 -m uvicorn service.main:app --reload

Python 3.12 is the intended local baseline. Parts of the service test surface now use modern union syntax such as str | None, so older 3.9-only virtualenvs will fail during collection even before the relevant backend code runs.

Install surface guidance:

pip install -r requirements.txt
- full developer install; includes both runtime deps and local/shadow tooling
pip install -r requirements-prod.txt
- lean deployed/runtime surface only; this is what the production image installs
- do not add local replay / browser / scraping extras here just because they are useful on a laptop
pip install -r requirements-local.txt
- local source-first and shadow add-ons such as curl_cffi, pyzotero, paperscraper, and pubmed-parser
- these are local Python backend/tooling dependencies, not packages bundled into the npm CLI or browser extension

Typical local developer bootstrap:

pip install -r requirements.txt

Pytest-backed service suite:

pip install -r requirements-test.txt
./.venv/bin/pytest tests/service -q

Full repository pytest sweep:

./.venv/bin/pytest -q

Canonical pre-launch readiness command

The older representative journey gate entrypoint has been archived and is no longer an active backend validation surface. Historical journey-gate materials now live under archive/validation/.

Current active verification entrypoints should be taken from the live parsing and topic-batch surfaces documented elsewhere in this README rather than from the archived journey gate.

Maintenance Rules

keep credentials, publisher-side helpers, and operator workflows private
prefer tested behavior in service/ over one-off scripts
keep agent install docs aligned with the current beta onboarding path
treat this repository as the release gate for production auth, billing, parsing, translation, and helper-serving behavior

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

mdtero

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2026.4.26.3

May 14, 2026

2026.4.26.2

May 14, 2026

This version

2026.4.26.1

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdtero-2026.4.26.1.tar.gz (468.7 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mdtero-2026.4.26.1-py3-none-any.whl (585.0 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file mdtero-2026.4.26.1.tar.gz.

File metadata

Download URL: mdtero-2026.4.26.1.tar.gz
Upload date: May 14, 2026
Size: 468.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mdtero-2026.4.26.1.tar.gz
Algorithm	Hash digest
SHA256	`7bd37b40e5faf01809e1d7a40d8c0f220e72e06d427b539c77be6de87c2d65c3`
MD5	`a46f32d8d43a6348f61e984c70c5587b`
BLAKE2b-256	`347d6981857d83f268dff79fad478dba9639c273644e642378c494ce125c168f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdtero-2026.4.26.1.tar.gz:

Publisher: publish-mdtero-cli.yml on JonbinC/mdtero-backend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mdtero-2026.4.26.1.tar.gz
- Subject digest: 7bd37b40e5faf01809e1d7a40d8c0f220e72e06d427b539c77be6de87c2d65c3
- Sigstore transparency entry: 1537562197
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: JonbinC/mdtero-backend@929ba3ede21062cbb009c3da765d457afa4864cf
- Branch / Tag: refs/heads/main
- Owner: https://github.com/JonbinC
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-mdtero-cli.yml@929ba3ede21062cbb009c3da765d457afa4864cf
- Trigger Event: workflow_dispatch

File details

Details for the file mdtero-2026.4.26.1-py3-none-any.whl.

File metadata

Download URL: mdtero-2026.4.26.1-py3-none-any.whl
Upload date: May 14, 2026
Size: 585.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mdtero-2026.4.26.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d806d363f58fcb7b19d7f2d4e585c6b4dcc416460536e540368daed9590b64d4`
MD5	`ccbbf288f85a511f7f763992a6a6a7a8`
BLAKE2b-256	`f191ab520b96bc7d8f58edd28902b1db33cbe7818deeb1b87f9c86e257ad3ef7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdtero-2026.4.26.1-py3-none-any.whl:

Publisher: publish-mdtero-cli.yml on JonbinC/mdtero-backend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mdtero-2026.4.26.1-py3-none-any.whl
- Subject digest: d806d363f58fcb7b19d7f2d4e585c6b4dcc416460536e540368daed9590b64d4
- Sigstore transparency entry: 1537562239
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: JonbinC/mdtero-backend@929ba3ede21062cbb009c3da765d457afa4864cf
- Branch / Tag: refs/heads/main
- Owner: https://github.com/JonbinC
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-mdtero-cli.yml@929ba3ede21062cbb009c3da765d457afa4864cf
- Trigger Event: workflow_dispatch

mdtero 2026.4.26.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

Mdtero Private Backend

Public / Private Split

Deployment Truth

Install / Bootstrap Truth

Repository Map

Experimental Parsing Progress

Grounded Chat Contract

Local Development

Canonical pre-launch readiness command

Maintenance Rules

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance