Skip to main content

Local-first arXiv metadata broker with shared cache, queue, and durable rate limiting.

Project description

Huldra

Huldra is a local arXiv metadata broker for one machine. Programs that need arXiv papers can ask Huldra for metadata instead of each calling export.arxiv.org directly. Huldra shares a SQLite cache, request queue, durable rate limiter, cooldown state, and upstream lease across those programs.

Huldra is an independent package and CLI. It is not a plugin for another project, and it does not depend on Recoleta.

Install

Install the published package:

pip install huldra-arxiv
huldra --help

For local development:

uv sync --dev
uv run huldra --help

The PyPI package name is huldra-arxiv. The Python package and CLI command are still named huldra.

The default database is:

~/.local/share/huldra/huldra.db

Override it per command with --db PATH or with HULDRA_DB_PATH.

Run Locally

Initialize a store:

uv run huldra store init --db ~/.local/share/huldra/huldra.db

Start the local HTTP API:

uv run huldra daemon --db ~/.local/share/huldra/huldra.db --host 127.0.0.1 --port 8765

Run a foreground worker in a separate terminal:

uv run huldra worker --db ~/.local/share/huldra/huldra.db --poll-interval-seconds 300 --json

Check status:

uv run huldra status --db ~/.local/share/huldra/huldra.db --json

Status includes queue depth, cache totals, durable upstream 429 totals, cooldown state, worker heartbeat, worker next wake, and the last worker error.

The API binds to 127.0.0.1 by default. Do not expose it to a public network without a reverse proxy and authentication.

CLI Query

Submit a query without waiting for the worker:

uv run huldra query \
  --db ~/.local/share/huldra/huldra.db \
  --client-id demo \
  --search-query 'cat:cs.AI AND all:agent' \
  --max-results 50 \
  --json

Read a completed result:

uv run huldra result --db ~/.local/share/huldra/huldra.db --cache-key KEY --json

huldra result is a raw cache inspection command. It reports whether the stored cache entry is readable and returns cached papers when it can. It does not reinterpret the cache for a caller's analysis_ready policy.

Look up one cached paper:

uv run huldra paper --db ~/.local/share/huldra/huldra.db --arxiv-id 2401.00001 --json

Sync a submitted-date UTC day and optionally wait for the worker path inline:

uv run huldra sync \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI AND all:agent' \
  --date 2026-05-20 \
  --max-results 60 \
  --wait \
  --json

Backfill daily submitted-date windows:

uv run huldra backfill \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI' \
  --start-date 2026-05-01 \
  --end-date 2026-05-20 \
  --max-results 60 \
  --json

Python Client

from huldra.client import HuldraClient

with HuldraClient(base_url="http://127.0.0.1:8765") as client:
    result = client.ensure_search(
        search_query="cat:cs.AI AND all:agent",
        max_results=50,
        wait=True,
    )
    print(result.status, result.papers_total)

For Recoleta-style pre-syncs, call the maintenance surface instead of shelling out to the CLI:

from datetime import UTC, datetime, timedelta

from huldra.client import HuldraClient
from huldra.models import ArxivRequest, CachePolicy, ReadinessMode

day = datetime(2026, 5, 20, tzinfo=UTC)
request = ArxivRequest(
    client_id="recoleta:embodied_ai",
    search_query="cat:cs.AI",
    submitted_start=day,
    submitted_end=day + timedelta(days=1),
    max_results=60,
    cache_policy=CachePolicy.CACHE_ONLY,
    readiness=ReadinessMode.ANALYSIS_READY,
)

with HuldraClient(base_url="http://127.0.0.1:8765") as client:
    summary = client.sync_windows([request], wait=True, wait_timeout_seconds=30)
    print(summary.completed_windows_total, summary.upstream_requests_total)

Maintenance completion means the raw cache is readable. The per-request serving_status still tells you whether the same cache is currently accepted by the request's readiness mode.

Safe Readiness

Use readiness="analysis_ready" for ingestion paths that must not consume immature submitted-date windows. If a completed window is still inside the maturity lag, Huldra returns:

  • status="immature"
  • ready=false
  • analysis_ready=false
  • blocked_reason="immature_window"
  • an empty papers list
  • cached_papers_total with the number of suppressed cached papers

Use readiness="raw_completed" for exploratory reads that may inspect same-day metadata. Raw reads can return papers from an immature window, but they still report analysis_ready=false, mature=false, and blocked_reason="immature_window".

Set request-level maturity_lag_days=0 only when the caller explicitly wants to disable maturity blocking. This field changes readiness interpretation; it does not change the cache key.

Submitted-date bounds must be UTC minute-aligned. Huldra rejects bounds with seconds or microseconds instead of silently widening or narrowing the window.

HTTP API

curl http://127.0.0.1:8765/v1/status

curl -X POST http://127.0.0.1:8765/v1/requests \
  -H 'content-type: application/json' \
  -d '{"client_id":"demo","search_query":"cat:cs.AI","max_results":10}'

Rate Limits And 429 Cooldown

Huldra keeps all arXiv legacy API access behind one durable limiter. The default request interval is 5 seconds, which is more conservative than arXiv's 3 second minimum. Only one upstream fetch lease can be held at a time.

When arXiv returns HTTP 429, Huldra persists cooldown_until in SQLite. New requests can still be queued, but workers will not probe upstream again until the cooldown expires.

Metadata-Only Boundary

This MVP stores descriptive metadata from the arXiv API: IDs, titles, abstracts, authors, categories, publication dates, comments, journal references, DOIs, and small provenance fields. It does not cache or serve PDFs, source tarballs, generated full text, or paper HTML.

Non-Goals

  • No Recoleta dependency or runtime adapter.
  • No PDF, source, or full-text cache.
  • No multi-machine distributed limiter. For more than one machine, run one shared broker or add a future shared rate-state backend.
  • No OAI-PMH backend yet. The current backend is the legacy search API.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huldra_arxiv-0.1.0.tar.gz (105.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

huldra_arxiv-0.1.0-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file huldra_arxiv-0.1.0.tar.gz.

File metadata

  • Download URL: huldra_arxiv-0.1.0.tar.gz
  • Upload date:
  • Size: 105.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for huldra_arxiv-0.1.0.tar.gz
Algorithm Hash digest
SHA256 139a41134d1f57c9d40d9e9acecef45df2b7cf3bce2207a587cd000ab81fa5c4
MD5 e0015bf0be021a5a3796bc4f0b77d60f
BLAKE2b-256 33d89b1b650ee5e14c901a59c4e3862248e8c6adfb766550d9a44ec0dee81d6e

See more details on using hashes here.

File details

Details for the file huldra_arxiv-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: huldra_arxiv-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for huldra_arxiv-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81aaaa6998d748099347a9d1c84ddf8c3995917ef5c0183c600e37a503058854
MD5 52f05c606ce0d854e55644a8b98dc6e8
BLAKE2b-256 cd7ff75b1f09cac41459ecd3082294bf242574d96be519b928e6bd2fd7e8f893

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page