Skip to main content

Local-first arXiv metadata broker with shared cache, queue, and durable rate limiting.

Project description

Huldra

Huldra is a local arXiv metadata broker for one machine. Programs that need arXiv papers can ask Huldra for metadata instead of each calling arXiv directly. Huldra shares a SQLite cache, request queue, durable rate limiter, cooldown state, and upstream lease across those programs.

Huldra is an independent package and CLI. It is not a plugin for another project, and it does not depend on Recoleta.

Install

Install the published package:

pip install huldra-arxiv
huldra --help

For local development:

uv sync --dev
uv run huldra --help

The PyPI package name is huldra-arxiv. The Python package and CLI command are still named huldra.

The default database is:

~/.local/share/huldra/huldra.db

Override it per command with --db PATH or with HULDRA_DB_PATH.

Run Locally

Initialize a store:

uv run huldra store init --db ~/.local/share/huldra/huldra.db

Start the local HTTP API:

uv run huldra daemon --db ~/.local/share/huldra/huldra.db --host 127.0.0.1 --port 8765

Run a foreground worker in a separate terminal:

uv run huldra worker --db ~/.local/share/huldra/huldra.db --poll-interval-seconds 300 --json

Check status:

uv run huldra status --db ~/.local/share/huldra/huldra.db --json

Status includes queue depth, cache totals, durable upstream 429 totals, cooldown state, worker heartbeat, worker next wake, and the last worker error.

The API binds to 127.0.0.1 by default. Do not expose it to a public network without a reverse proxy and authentication.

CLI Query

Submit a query without waiting for the worker:

uv run huldra query \
  --db ~/.local/share/huldra/huldra.db \
  --client-id demo \
  --search-query 'cat:cs.AI AND all:agent' \
  --max-results 50 \
  --json

Read a completed result:

uv run huldra result --db ~/.local/share/huldra/huldra.db --cache-key KEY --json

huldra result is a raw cache inspection command. It reports whether the stored cache entry is readable and returns cached papers when it can. It does not reinterpret the cache for a caller's analysis_ready policy.

Look up one cached paper:

uv run huldra paper --db ~/.local/share/huldra/huldra.db --arxiv-id 2401.00001 --json

Sync a submitted-date UTC day and optionally wait for the worker path inline. By default this completes one legacy search slice and reports coverage_status="slice" even when arXiv says more results exist:

uv run huldra sync \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI AND all:agent' \
  --date 2026-05-20 \
  --max-results 60 \
  --wait \
  --json

Fetch every legacy search page for a bounded window by opting into complete window mode:

uv run huldra sync \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI AND all:agent' \
  --date 2026-05-20 \
  --max-results 60 \
  --mode complete-window \
  --wait \
  --json

Backfill daily submitted-date windows:

uv run huldra backfill \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI' \
  --start-date 2026-05-01 \
  --end-date 2026-05-20 \
  --max-results 60 \
  --json

Run an OAI-PMH harvest for complete or category-scoped metadata sync:

uv run huldra harvest oai \
  --db ~/.local/share/huldra/huldra.db \
  --metadata-prefix arXiv \
  --set cs:cs:AI \
  --mode incremental \
  --json

Python Client

from huldra.client import HuldraClient

with HuldraClient(base_url="http://127.0.0.1:8765") as client:
    result = client.ensure_search(
        search_query="cat:cs.AI AND all:agent",
        max_results=50,
        wait=True,
    )
    print(result.status, result.papers_total)

For Recoleta-style pre-syncs, call the maintenance surface instead of shelling out to the CLI:

from datetime import UTC, datetime, timedelta

from huldra.client import HuldraClient
from huldra.models import ArxivRequest, CachePolicy, ReadinessMode

day = datetime(2026, 5, 20, tzinfo=UTC)
request = ArxivRequest(
    client_id="recoleta:embodied_ai",
    search_query="cat:cs.AI",
    submitted_start=day,
    submitted_end=day + timedelta(days=1),
    max_results=60,
    cache_policy=CachePolicy.CACHE_ONLY,
    readiness=ReadinessMode.ANALYSIS_READY,
)

with HuldraClient(base_url="http://127.0.0.1:8765") as client:
    summary = client.sync_windows([request], wait=True, wait_timeout_seconds=30)
    print(summary.completed_windows_total, summary.upstream_requests_total)

Maintenance completion means the raw cache is readable. The per-request serving_status still tells you whether the same cache is currently accepted by the request's readiness mode. For legacy search, check coverage_status, completed_slices_total, pages_total, and pages_completed_total before treating a window as complete.

Safe Readiness

Use readiness="analysis_ready" for ingestion paths that must not consume immature submitted-date windows. If a completed window is still inside the maturity lag, Huldra returns:

  • status="immature"
  • ready=false
  • analysis_ready=false
  • blocked_reason="immature_window"
  • an empty papers list
  • cached_papers_total with the number of suppressed cached papers

Use readiness="raw_completed" for exploratory reads that may inspect same-day metadata. Raw reads can return papers from an immature window, but they still report analysis_ready=false, mature=false, and blocked_reason="immature_window".

Set request-level maturity_lag_days=0 only when the caller explicitly wants to disable maturity blocking. This field changes readiness interpretation; it does not change the cache key.

Submitted-date bounds must be UTC minute-aligned. Huldra rejects bounds with seconds or microseconds instead of silently widening or narrowing the window.

HTTP API

curl http://127.0.0.1:8765/v1/status

curl -X POST http://127.0.0.1:8765/v1/requests \
  -H 'content-type: application/json' \
  -d '{"client_id":"demo","search_query":"cat:cs.AI","max_results":10}'

Rate Limits And 429 Cooldown

Huldra keeps all arXiv legacy API access behind one durable limiter. The default request interval is 5 seconds, which is more conservative than arXiv's 3 second minimum. Only one upstream fetch lease can be held at a time.

When arXiv returns HTTP 429, Huldra persists cooldown_until in SQLite. New requests can still be queued, but workers will not probe upstream again until the cooldown expires.

OAI-PMH Harvesting

The OAI-PMH surface uses https://oaipmh.arxiv.org/oai by default and stores harvest jobs, page state, watermarks, raw OAI records, deleted headers, and normalized paper metadata. Incremental harvests use the last successful server response date or datestamp watermark unless --from is provided explicitly. Watermarks advance only after all pages in the harvest succeed. If a harvest stops after receiving a resumption token, rerun the same harvest and Huldra will continue from the saved token. To continue from a specific token, pass --resumption-token.

Use legacy search for request-sized slices and complete-window maintenance. Use OAI-PMH for full mirrors, category-scoped mirrors, and datestamp-based incremental sync.

Metadata-Only Boundary

This package stores descriptive metadata from arXiv: IDs, titles, abstracts, authors, categories, publication dates, comments, journal references, DOIs, OAI identifiers, OAI datestamps, set specs, license fields, deleted-record state, and raw metadata needed for reprocessing. It does not cache or serve PDFs, source tarballs, generated full text, or paper HTML.

Non-Goals

  • No Recoleta dependency or runtime adapter.
  • No PDF, source, or full-text cache.
  • No multi-machine distributed limiter. For more than one machine, run one shared broker or add a future shared rate-state backend.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huldra_arxiv-0.2.0.tar.gz (137.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

huldra_arxiv-0.2.0-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file huldra_arxiv-0.2.0.tar.gz.

File metadata

  • Download URL: huldra_arxiv-0.2.0.tar.gz
  • Upload date:
  • Size: 137.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for huldra_arxiv-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a5cf77b1f7320c91a081604165de9dbc074bee4a1121388cad72252a2779ec69
MD5 f49e8666cb1c0261edf31ff03b48af14
BLAKE2b-256 5eb9ac8fd32ee6c0c77e1d66ab4da883667d11ac44f6eabfda9524cfcf618b2c

See more details on using hashes here.

File details

Details for the file huldra_arxiv-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: huldra_arxiv-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 56.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for huldra_arxiv-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54e8363c0937f691f04f71620bfb64116ab630eea544a0b77482b3c27276d4da
MD5 811aa2efc2acba0a96f66bd775b87e16
BLAKE2b-256 409cefff54dc677812c5acecfdebd97ee1f40994b607ba9c9d286c383dc6007d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page