Local-first arXiv metadata broker with shared cache, queue, and durable rate limiting.
Project description
Huldra
Huldra is a local arXiv metadata broker for one machine. Programs that need
arXiv papers can ask Huldra for metadata instead of each calling
export.arxiv.org directly. Huldra shares a SQLite cache, request queue,
durable rate limiter, cooldown state, and upstream lease across those programs.
Huldra is an independent package and CLI. It is not a plugin for another project, and it does not depend on Recoleta.
Install
Install the published package:
pip install huldra-arxiv
huldra --help
For local development:
uv sync --dev
uv run huldra --help
The PyPI package name is huldra-arxiv. The Python package and CLI command are
still named huldra.
The default database is:
~/.local/share/huldra/huldra.db
Override it per command with --db PATH or with HULDRA_DB_PATH.
Run Locally
Initialize a store:
uv run huldra store init --db ~/.local/share/huldra/huldra.db
Start the local HTTP API:
uv run huldra daemon --db ~/.local/share/huldra/huldra.db --host 127.0.0.1 --port 8765
Run a foreground worker in a separate terminal:
uv run huldra worker --db ~/.local/share/huldra/huldra.db --poll-interval-seconds 300 --json
Check status:
uv run huldra status --db ~/.local/share/huldra/huldra.db --json
Status includes queue depth, cache totals, durable upstream 429 totals, cooldown state, worker heartbeat, worker next wake, and the last worker error.
The API binds to 127.0.0.1 by default. Do not expose it to a public network
without a reverse proxy and authentication.
CLI Query
Submit a query without waiting for the worker:
uv run huldra query \
--db ~/.local/share/huldra/huldra.db \
--client-id demo \
--search-query 'cat:cs.AI AND all:agent' \
--max-results 50 \
--json
Read a completed result:
uv run huldra result --db ~/.local/share/huldra/huldra.db --cache-key KEY --json
huldra result is a raw cache inspection command. It reports whether the
stored cache entry is readable and returns cached papers when it can. It does
not reinterpret the cache for a caller's analysis_ready policy.
Look up one cached paper:
uv run huldra paper --db ~/.local/share/huldra/huldra.db --arxiv-id 2401.00001 --json
Sync a submitted-date UTC day and optionally wait for the worker path inline:
uv run huldra sync \
--db ~/.local/share/huldra/huldra.db \
--search-query 'cat:cs.AI AND all:agent' \
--date 2026-05-20 \
--max-results 60 \
--wait \
--json
Backfill daily submitted-date windows:
uv run huldra backfill \
--db ~/.local/share/huldra/huldra.db \
--search-query 'cat:cs.AI' \
--start-date 2026-05-01 \
--end-date 2026-05-20 \
--max-results 60 \
--json
Python Client
from huldra.client import HuldraClient
with HuldraClient(base_url="http://127.0.0.1:8765") as client:
result = client.ensure_search(
search_query="cat:cs.AI AND all:agent",
max_results=50,
wait=True,
)
print(result.status, result.papers_total)
For Recoleta-style pre-syncs, call the maintenance surface instead of shelling out to the CLI:
from datetime import UTC, datetime, timedelta
from huldra.client import HuldraClient
from huldra.models import ArxivRequest, CachePolicy, ReadinessMode
day = datetime(2026, 5, 20, tzinfo=UTC)
request = ArxivRequest(
client_id="recoleta:embodied_ai",
search_query="cat:cs.AI",
submitted_start=day,
submitted_end=day + timedelta(days=1),
max_results=60,
cache_policy=CachePolicy.CACHE_ONLY,
readiness=ReadinessMode.ANALYSIS_READY,
)
with HuldraClient(base_url="http://127.0.0.1:8765") as client:
summary = client.sync_windows([request], wait=True, wait_timeout_seconds=30)
print(summary.completed_windows_total, summary.upstream_requests_total)
Maintenance completion means the raw cache is readable. The per-request
serving_status still tells you whether the same cache is currently accepted
by the request's readiness mode.
Safe Readiness
Use readiness="analysis_ready" for ingestion paths that must not consume
immature submitted-date windows. If a completed window is still inside the
maturity lag, Huldra returns:
status="immature"ready=falseanalysis_ready=falseblocked_reason="immature_window"- an empty
paperslist cached_papers_totalwith the number of suppressed cached papers
Use readiness="raw_completed" for exploratory reads that may inspect same-day
metadata. Raw reads can return papers from an immature window, but they still
report analysis_ready=false, mature=false, and
blocked_reason="immature_window".
Set request-level maturity_lag_days=0 only when the caller explicitly wants
to disable maturity blocking. This field changes readiness interpretation; it
does not change the cache key.
Submitted-date bounds must be UTC minute-aligned. Huldra rejects bounds with seconds or microseconds instead of silently widening or narrowing the window.
HTTP API
curl http://127.0.0.1:8765/v1/status
curl -X POST http://127.0.0.1:8765/v1/requests \
-H 'content-type: application/json' \
-d '{"client_id":"demo","search_query":"cat:cs.AI","max_results":10}'
Rate Limits And 429 Cooldown
Huldra keeps all arXiv legacy API access behind one durable limiter. The default request interval is 5 seconds, which is more conservative than arXiv's 3 second minimum. Only one upstream fetch lease can be held at a time.
When arXiv returns HTTP 429, Huldra persists cooldown_until in SQLite. New
requests can still be queued, but workers will not probe upstream again until
the cooldown expires.
Metadata-Only Boundary
This MVP stores descriptive metadata from the arXiv API: IDs, titles, abstracts, authors, categories, publication dates, comments, journal references, DOIs, and small provenance fields. It does not cache or serve PDFs, source tarballs, generated full text, or paper HTML.
Non-Goals
- No Recoleta dependency or runtime adapter.
- No PDF, source, or full-text cache.
- No multi-machine distributed limiter. For more than one machine, run one shared broker or add a future shared rate-state backend.
- No OAI-PMH backend yet. The current backend is the legacy search API.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file huldra_arxiv-0.1.0.tar.gz.
File metadata
- Download URL: huldra_arxiv-0.1.0.tar.gz
- Upload date:
- Size: 105.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
139a41134d1f57c9d40d9e9acecef45df2b7cf3bce2207a587cd000ab81fa5c4
|
|
| MD5 |
e0015bf0be021a5a3796bc4f0b77d60f
|
|
| BLAKE2b-256 |
33d89b1b650ee5e14c901a59c4e3862248e8c6adfb766550d9a44ec0dee81d6e
|
File details
Details for the file huldra_arxiv-0.1.0-py3-none-any.whl.
File metadata
- Download URL: huldra_arxiv-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81aaaa6998d748099347a9d1c84ddf8c3995917ef5c0183c600e37a503058854
|
|
| MD5 |
52f05c606ce0d854e55644a8b98dc6e8
|
|
| BLAKE2b-256 |
cd7ff75b1f09cac41459ecd3082294bf242574d96be519b928e6bd2fd7e8f893
|