Skip to main content

Content-extraction provider runtime for arcus — turn a URL or file into normalized markdown + structured metadata.

Project description

arcus-provider-runtime

The content-extraction kernel behind arcus: give it one URL or one file path, get back normalized markdown plus structured metadata. No vault, no database, no project awareness — a pure download + extraction layer you can drop into any pipeline (RAG ingest, knowledge bases, LLM context building).

Install

pip install "arcus-provider-runtime[html,pdf,office]"

Extras pull in only the heavy dependencies you need:

Extra Adds For
html playwright JS-rendered pages, X.com / LinkedIn, SPA articles
pdf pymupdf4llm PDF → markdown extraction
office python-docx, python-pptx, openpyxl DOCX / PPTX / XLSX / EPUB
all everything above

The base install (YouTube transcripts via yt-dlp) has no extras. The HTML provider also needs Chromium (python -m playwright install chromium) and node on PATH (the vendored html2md.mjs converter).

Use

from arcus.provider_runtime import Factory

result = Factory().run("https://example.com/article", out_dir="./out")
# result.markdown_path  → ./out/<slug>.md   (frontmatter + readable body)
# result.metadata_path  → ./out/<slug>.json (segments, timing, provenance)

One Factory.run() entry point dispatches to the right provider by inspecting the input. Providers live under arcus.provider_runtime.providers.<kind>/ and are individually registerable.

What it deliberately does NOT do

arcus has zero awareness of any consuming app's storage, topics, or wiki. One input in, one extracted artifact out. Vault-aware orchestration (dedup, cross-referencing, synthesis) belongs in the consumer, not here.

License

MIT © 2026 POLLEO.AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arcus_provider_runtime-0.4.0.tar.gz (139.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arcus_provider_runtime-0.4.0-py3-none-any.whl (65.9 kB view details)

Uploaded Python 3

File details

Details for the file arcus_provider_runtime-0.4.0.tar.gz.

File metadata

  • Download URL: arcus_provider_runtime-0.4.0.tar.gz
  • Upload date:
  • Size: 139.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arcus_provider_runtime-0.4.0.tar.gz
Algorithm Hash digest
SHA256 276777bfe6ab42537ee4d2545d3ce2fdf04498b9eb6bcfeda3895138be33635f
MD5 3450bf6bbad2c862b322a68729f3c910
BLAKE2b-256 309edb76384a198a431bebd4b2cd4c53a55f4cb05845884eb0549871b30bff33

See more details on using hashes here.

Provenance

The following attestation bundles were made for arcus_provider_runtime-0.4.0.tar.gz:

Publisher: release.yml on polleoai/arcus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arcus_provider_runtime-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arcus_provider_runtime-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c9f2cb2d58afbf38c8e6f5f8b8c848ea101662520227b258cd784dd2d00334d4
MD5 5674bc37778ccd6c84974326534f2b0e
BLAKE2b-256 1ac75bc8559ed087d7fd79ee96c26e9852ec8c82e00980837402339208a2e32c

See more details on using hashes here.

Provenance

The following attestation bundles were made for arcus_provider_runtime-0.4.0-py3-none-any.whl:

Publisher: release.yml on polleoai/arcus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page