Local-first arXiv metadata broker with shared cache, queue, and durable rate limiting.
Project description
Huldra
Huldra is a local arXiv metadata broker for one machine. Programs that need arXiv papers can ask Huldra for metadata instead of each calling arXiv directly. Huldra shares a SQLite cache, request queue, durable rate limiter, cooldown state, and upstream lease across those programs.
Huldra is an independent package and CLI. It is not a plugin for another project, and it does not depend on Recoleta.
Install
Install the published package:
pip install huldra-arxiv
huldra --help
For local development:
uv sync --dev
uv run huldra --help
The PyPI package name is huldra-arxiv. The Python package and CLI command are
still named huldra.
The default database is:
~/.local/share/huldra/huldra.db
Override it per command with --db PATH or with HULDRA_DB_PATH.
Run Locally
Initialize a store:
uv run huldra store init --db ~/.local/share/huldra/huldra.db
Start the local HTTP API:
uv run huldra daemon --db ~/.local/share/huldra/huldra.db --host 127.0.0.1 --port 8765
Run a foreground worker in a separate terminal:
uv run huldra worker --db ~/.local/share/huldra/huldra.db --poll-interval-seconds 300 --json
Check status:
uv run huldra status --db ~/.local/share/huldra/huldra.db --json
Status includes queue depth, cache totals, durable upstream 429 totals, cooldown state, worker heartbeat, worker next wake, and the last worker error.
The API binds to 127.0.0.1 by default. Do not expose it to a public network
without a reverse proxy and authentication.
CLI Query
Submit a query without waiting for the worker:
uv run huldra query \
--db ~/.local/share/huldra/huldra.db \
--client-id demo \
--search-query 'cat:cs.AI AND all:agent' \
--max-results 50 \
--json
Read a completed result:
uv run huldra result --db ~/.local/share/huldra/huldra.db --cache-key KEY --json
huldra result is a raw cache inspection command. It reports whether the
stored cache entry is readable and returns cached papers when it can. It does
not reinterpret the cache for a caller's analysis_ready policy.
Look up one cached paper:
uv run huldra paper --db ~/.local/share/huldra/huldra.db --arxiv-id 2401.00001 --json
Sync a submitted-date UTC day and optionally wait for the worker path inline.
By default this completes one legacy search slice and reports
coverage_status="slice" even when arXiv says more results exist:
uv run huldra sync \
--db ~/.local/share/huldra/huldra.db \
--search-query 'cat:cs.AI AND all:agent' \
--date 2026-05-20 \
--max-results 60 \
--wait \
--json
Fetch every legacy search page for a bounded window by opting into complete window mode:
uv run huldra sync \
--db ~/.local/share/huldra/huldra.db \
--search-query 'cat:cs.AI AND all:agent' \
--date 2026-05-20 \
--max-results 60 \
--mode complete-window \
--wait \
--json
Backfill daily submitted-date windows:
uv run huldra backfill \
--db ~/.local/share/huldra/huldra.db \
--search-query 'cat:cs.AI' \
--start-date 2026-05-01 \
--end-date 2026-05-20 \
--max-results 60 \
--json
Run an OAI-PMH harvest for complete or category-scoped metadata sync:
uv run huldra harvest oai \
--db ~/.local/share/huldra/huldra.db \
--metadata-prefix arXiv \
--set cs:cs:AI \
--mode incremental \
--json
Python Client
from huldra.client import HuldraClient
with HuldraClient(base_url="http://127.0.0.1:8765") as client:
result = client.ensure_search(
search_query="cat:cs.AI AND all:agent",
max_results=50,
wait=True,
)
print(result.status, result.papers_total)
For Recoleta-style pre-syncs, call the maintenance surface instead of shelling out to the CLI:
from datetime import UTC, datetime, timedelta
from huldra.client import HuldraClient
from huldra.models import ArxivRequest, CachePolicy, ReadinessMode
day = datetime(2026, 5, 20, tzinfo=UTC)
request = ArxivRequest(
client_id="recoleta:embodied_ai",
search_query="cat:cs.AI",
submitted_start=day,
submitted_end=day + timedelta(days=1),
max_results=60,
cache_policy=CachePolicy.CACHE_ONLY,
readiness=ReadinessMode.ANALYSIS_READY,
)
with HuldraClient(base_url="http://127.0.0.1:8765") as client:
summary = client.sync_windows([request], wait=True, wait_timeout_seconds=30)
print(summary.completed_windows_total, summary.upstream_requests_total)
Maintenance completion means the raw cache is readable. The per-request
serving_status still tells you whether the same cache is currently accepted
by the request's readiness mode. For legacy search, check coverage_status,
completed_slices_total, pages_total, and pages_completed_total before
treating a window as complete.
Safe Readiness
Use readiness="analysis_ready" for ingestion paths that must not consume
immature submitted-date windows. If a completed window is still inside the
maturity lag, Huldra returns:
status="immature"ready=falseanalysis_ready=falseblocked_reason="immature_window"- an empty
paperslist cached_papers_totalwith the number of suppressed cached papers
Use readiness="raw_completed" for exploratory reads that may inspect same-day
metadata. Raw reads can return papers from an immature window, but they still
report analysis_ready=false, mature=false, and
blocked_reason="immature_window".
Set request-level maturity_lag_days=0 only when the caller explicitly wants
to disable maturity blocking. This field changes readiness interpretation; it
does not change the cache key.
Submitted-date bounds must be UTC minute-aligned. Huldra rejects bounds with seconds or microseconds instead of silently widening or narrowing the window.
HTTP API
curl http://127.0.0.1:8765/v1/status
curl -X POST http://127.0.0.1:8765/v1/requests \
-H 'content-type: application/json' \
-d '{"client_id":"demo","search_query":"cat:cs.AI","max_results":10}'
Rate Limits And 429 Cooldown
Huldra keeps all arXiv legacy API access behind one durable limiter. The default request interval is 5 seconds, which is more conservative than arXiv's 3 second minimum. Only one upstream fetch lease can be held at a time.
When arXiv returns HTTP 429, Huldra persists cooldown_until in SQLite. New
requests can still be queued, but workers will not probe upstream again until
the cooldown expires.
OAI-PMH Harvesting
The OAI-PMH surface uses https://oaipmh.arxiv.org/oai by default and stores
harvest jobs, page state, watermarks, raw OAI records, deleted headers, and
normalized paper metadata. Incremental harvests use the last successful server
response date or datestamp watermark unless --from is provided explicitly.
Watermarks advance only after all pages in the harvest succeed. If a harvest
stops after receiving a resumption token, rerun the same harvest and Huldra will
continue from the saved token. To continue from a specific token, pass
--resumption-token.
Use legacy search for request-sized slices and complete-window maintenance. Use OAI-PMH for full mirrors, category-scoped mirrors, and datestamp-based incremental sync.
Metadata-Only Boundary
This package stores descriptive metadata from arXiv: IDs, titles, abstracts, authors, categories, publication dates, comments, journal references, DOIs, OAI identifiers, OAI datestamps, set specs, license fields, deleted-record state, and raw metadata needed for reprocessing. It does not cache or serve PDFs, source tarballs, generated full text, or paper HTML.
Non-Goals
- No Recoleta dependency or runtime adapter.
- No PDF, source, or full-text cache.
- No multi-machine distributed limiter. For more than one machine, run one shared broker or add a future shared rate-state backend.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file huldra_arxiv-0.2.0.tar.gz.
File metadata
- Download URL: huldra_arxiv-0.2.0.tar.gz
- Upload date:
- Size: 137.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5cf77b1f7320c91a081604165de9dbc074bee4a1121388cad72252a2779ec69
|
|
| MD5 |
f49e8666cb1c0261edf31ff03b48af14
|
|
| BLAKE2b-256 |
5eb9ac8fd32ee6c0c77e1d66ab4da883667d11ac44f6eabfda9524cfcf618b2c
|
File details
Details for the file huldra_arxiv-0.2.0-py3-none-any.whl.
File metadata
- Download URL: huldra_arxiv-0.2.0-py3-none-any.whl
- Upload date:
- Size: 56.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54e8363c0937f691f04f71620bfb64116ab630eea544a0b77482b3c27276d4da
|
|
| MD5 |
811aa2efc2acba0a96f66bd775b87e16
|
|
| BLAKE2b-256 |
409cefff54dc677812c5acecfdebd97ee1f40994b607ba9c9d286c383dc6007d
|