Local-first arXiv metadata broker with shared cache, queue, and durable rate limiting.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- Console
Framework
- FastAPI
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Database
- Internet :: WWW/HTTP :: Indexing/Search

Project description

Huldra

Huldra is a local arXiv metadata broker for one machine. Programs that need arXiv papers can ask Huldra for metadata instead of each calling arXiv directly. Huldra shares a SQLite cache, request queue, durable rate limiter, cooldown state, and upstream lease across those programs.

Huldra is an independent package and CLI. It is not a plugin for another project, and it does not depend on Recoleta.

Install

Install the published package:

pip install huldra-arxiv
huldra --help

For local development:

uv sync --dev
uv run huldra --help

The PyPI package name is huldra-arxiv. The Python package and CLI command are still named huldra.

The default database is:

~/.local/share/huldra/huldra.db

Override it per command with --db PATH or with HULDRA_DB_PATH.

Run Locally

Initialize a store:

uv run huldra store init --db ~/.local/share/huldra/huldra.db

Start the local HTTP API:

uv run huldra daemon --db ~/.local/share/huldra/huldra.db --host 127.0.0.1 --port 8765

Run a foreground worker in a separate terminal:

uv run huldra worker --db ~/.local/share/huldra/huldra.db --poll-interval-seconds 300 --json

Check status:

uv run huldra status --db ~/.local/share/huldra/huldra.db --json

Status includes queue depth, cache totals, durable upstream 429 totals, cooldown state, worker heartbeat, worker next wake, and the last worker error.

The API binds to 127.0.0.1 by default. Do not expose it to a public network without a reverse proxy and authentication.

CLI Query

Submit a query without waiting for the worker:

uv run huldra query \
  --db ~/.local/share/huldra/huldra.db \
  --client-id demo \
  --search-query 'cat:cs.AI AND all:agent' \
  --max-results 50 \
  --json

Read a completed result:

uv run huldra result --db ~/.local/share/huldra/huldra.db --cache-key KEY --json

huldra result is a raw cache inspection command. It reports whether the stored cache entry is readable and returns cached papers when it can. It does not reinterpret the cache for a caller's analysis_ready policy.

Look up one cached paper:

uv run huldra paper --db ~/.local/share/huldra/huldra.db --arxiv-id 2401.00001 --json

Sync a submitted-date UTC day and optionally wait for the worker path inline. By default this completes one legacy search slice and reports coverage_status="slice" even when arXiv says more results exist:

uv run huldra sync \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI AND all:agent' \
  --date 2026-05-20 \
  --max-results 60 \
  --wait \
  --json

Fetch every legacy search page for a bounded window by opting into complete window mode:

uv run huldra sync \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI AND all:agent' \
  --date 2026-05-20 \
  --max-results 60 \
  --mode complete-window \
  --wait \
  --json

Backfill daily submitted-date windows:

uv run huldra backfill \
  --db ~/.local/share/huldra/huldra.db \
  --search-query 'cat:cs.AI' \
  --start-date 2026-05-01 \
  --end-date 2026-05-20 \
  --max-results 60 \
  --json

Run an OAI-PMH harvest for complete or category-scoped metadata sync:

uv run huldra harvest oai \
  --db ~/.local/share/huldra/huldra.db \
  --metadata-prefix arXiv \
  --set cs:cs:AI \
  --mode incremental \
  --json

Python Client

from huldra.client import HuldraClient

with HuldraClient(base_url="http://127.0.0.1:8765") as client:
    result = client.ensure_search(
        search_query="cat:cs.AI AND all:agent",
        max_results=50,
        wait=True,
    )
    print(result.status, result.papers_total)

For Recoleta-style pre-syncs, call the maintenance surface instead of shelling out to the CLI:

from datetime import UTC, datetime, timedelta

from huldra.client import HuldraClient
from huldra.models import ArxivRequest, CachePolicy, ReadinessMode

day = datetime(2026, 5, 20, tzinfo=UTC)
request = ArxivRequest(
    client_id="recoleta:embodied_ai",
    search_query="cat:cs.AI",
    submitted_start=day,
    submitted_end=day + timedelta(days=1),
    max_results=60,
    cache_policy=CachePolicy.CACHE_ONLY,
    readiness=ReadinessMode.ANALYSIS_READY,
)

with HuldraClient(base_url="http://127.0.0.1:8765") as client:
    summary = client.sync_windows([request], wait=True, wait_timeout_seconds=30)
    print(summary.completed_windows_total, summary.upstream_requests_total)

Maintenance completion means the raw cache is readable. The per-request serving_status still tells you whether the same cache is currently accepted by the request's readiness mode. For legacy search, check coverage_status, completed_slices_total, pages_total, and pages_completed_total before treating a window as complete.

Safe Readiness

Use readiness="analysis_ready" for ingestion paths that must not consume immature submitted-date windows. If a completed window is still inside the maturity lag, Huldra returns:

status="immature"
ready=false
analysis_ready=false
blocked_reason="immature_window"
an empty papers list
cached_papers_total with the number of suppressed cached papers

Use readiness="raw_completed" for exploratory reads that may inspect same-day metadata. Raw reads can return papers from an immature window, but they still report analysis_ready=false, mature=false, and blocked_reason="immature_window".

Set request-level maturity_lag_days=0 only when the caller explicitly wants to disable maturity blocking. This field changes readiness interpretation; it does not change the cache key.

Submitted-date bounds must be UTC minute-aligned. Huldra rejects bounds with seconds or microseconds instead of silently widening or narrowing the window.

HTTP API

curl http://127.0.0.1:8765/v1/status

curl -X POST http://127.0.0.1:8765/v1/requests \
  -H 'content-type: application/json' \
  -d '{"client_id":"demo","search_query":"cat:cs.AI","max_results":10}'

Rate Limits And 429 Cooldown

Huldra keeps all arXiv legacy API access behind one durable limiter. The default request interval is 5 seconds, which is more conservative than arXiv's 3 second minimum. Only one upstream fetch lease can be held at a time.

When arXiv returns HTTP 429, Huldra persists cooldown_until in SQLite. New requests can still be queued, but workers will not probe upstream again until the cooldown expires.

OAI-PMH Harvesting

The OAI-PMH surface uses https://oaipmh.arxiv.org/oai by default and stores harvest jobs, page state, watermarks, raw OAI records, deleted headers, and normalized paper metadata. Incremental harvests use the last successful server response date or datestamp watermark unless --from is provided explicitly. Watermarks advance only after all pages in the harvest succeed. If a harvest stops after receiving a resumption token, rerun the same harvest and Huldra will continue from the saved token. To continue from a specific token, pass --resumption-token.

Use legacy search for request-sized slices and complete-window maintenance. Use OAI-PMH for full mirrors, category-scoped mirrors, and datestamp-based incremental sync.

Metadata-Only Boundary

This package stores descriptive metadata from arXiv: IDs, titles, abstracts, authors, categories, publication dates, comments, journal references, DOIs, OAI identifiers, OAI datestamps, set specs, license fields, deleted-record state, and raw metadata needed for reprocessing. It does not cache or serve PDFs, source tarballs, generated full text, or paper HTML.

Non-Goals

No Recoleta dependency or runtime adapter.
No PDF, source, or full-text cache.
No multi-machine distributed limiter. For more than one machine, run one shared broker or add a future shared rate-state backend.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- Console
Framework
- FastAPI
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Database
- Internet :: WWW/HTTP :: Indexing/Search

Release history Release notifications | RSS feed

This version

0.2.0

May 29, 2026

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huldra_arxiv-0.2.0.tar.gz (137.0 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

huldra_arxiv-0.2.0-py3-none-any.whl (56.0 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file huldra_arxiv-0.2.0.tar.gz.

File metadata

Download URL: huldra_arxiv-0.2.0.tar.gz
Upload date: May 29, 2026
Size: 137.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for huldra_arxiv-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a5cf77b1f7320c91a081604165de9dbc074bee4a1121388cad72252a2779ec69`
MD5	`f49e8666cb1c0261edf31ff03b48af14`
BLAKE2b-256	`5eb9ac8fd32ee6c0c77e1d66ab4da883667d11ac44f6eabfda9524cfcf618b2c`

See more details on using hashes here.

File details

Details for the file huldra_arxiv-0.2.0-py3-none-any.whl.

File metadata

Download URL: huldra_arxiv-0.2.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 56.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for huldra_arxiv-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`54e8363c0937f691f04f71620bfb64116ab630eea544a0b77482b3c27276d4da`
MD5	`811aa2efc2acba0a96f66bd775b87e16`
BLAKE2b-256	`409cefff54dc677812c5acecfdebd97ee1f40994b607ba9c9d286c383dc6007d`

See more details on using hashes here.

huldra-arxiv 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Huldra

Install

Run Locally

CLI Query

Python Client

Safe Readiness

HTTP API

Rate Limits And 429 Cooldown

OAI-PMH Harvesting

Metadata-Only Boundary

Non-Goals

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes