OpenAlex S3 processor

These details have not been verified by PyPI

Project links

Project description

pyAlexS3

OpenAlex S3 → DuckDB loader with nice progress bars (powered by rich). It lists, filters, downloads (in parallel), and loads OpenAlex NDJSON dumps into DuckDB—either all at once, in batches, or lazily as an iterator.

Features

🚀 Parallel S3 downloads with a live progress bar
🦆 Zero-setup DuckDB loading via read_ndjson_auto(...)
🧩 Three loading modes:
- load_table: one-shot into a DuckDB table
- batch_load_table: append in batches
- lazy_load: yield a DuckDB relation per batch (no table needed)
🎯 Filter by date range (YYYY-MM-DD) and by part numbers
🔎 Optional SQL-style WHERE predicate
💾 Persistent or in-memory DuckDB

Installation

pip install pyalexs3

or with uv

uv add pyalexs3

Python 3.10+ is required.

Quick start

from pyalexs3.core import OpenAlexS3Processor

p = OpenAlexS3Processor(n_workers=4)
p.load_table(
    obj_type="works",
    start_date="2025-07-05",
    end_date="2025-07-20",
    download_dir="./.cache/oa",
    cols=["id", "title"]
)

table = p.get_table("works")
table.limit(5).show()

Filter with WHERE clause

p.load_table(
    obj_type="works",
    start_date="2025-07-05",
    end_date="2025-07-20",
    download_dir="./.cache/oa",
    cols=["id", "title", "type"],
    where_clause="WHERE title IS NOT NULL AND type='article'"
)

Batching and lazy load

Append in batches

p.batch_load_table(
    obj_type="works",
    batch_sz=5,  # ~number of S3 objects per batch
    start_date="2025-07-01",
    end_date="2025-07-02",
    cols=["id", "title"],
    download_dir="./.cache/oa",
)

# Everything lands in the same DuckDB table:
p.get_table("works").count("*").show()

Iterate lazily (no table required)

titles = []
for rel in p.lazy_load(
    obj_type="works",
    batch_sz=5,
    start_date="2025-07-01",
    end_date="2025-07-02",
    cols=["id", "title"],
    download_dir="./.cache/oa",
):
    df = rel.df()  # materialize this batch
    titles.extend(df["title"].tolist())

API

OpenAlexS3Processor(n_workers: int = 4, persist_path: str | None = None)

n_workers: number of threads for downloads.
persist_path: if set, uses a persistent DuckDB database file at this path; otherwise an in-memory DB.

load_table(...) -> None Downloads all matching files and creates/appends a DuckDB table named after obj_type.

  Args:
  - `obj_type`: one of `{"works","authors","sources","institutions","topics","keywords","publishers","funders","geo"}`
  - `cols`: `list[str]` of columns to select (default `*`)
  - `limit`: `int | None` (applied after read)
  - `start_date`, `end_date`: ISO "YYYY-MM-DD" strings (inclusive). If `start_date` is None, it’s inferred from S3; if `end_date` is None, defaults to today.
  - `parts`: `list[int] | None` — specific part numbers (e.g., [0,2]). None = all.
  - `download_dir`: temporary folder for gz files (deleted after load)
  - `where_clause`: SQL predicate like "WHERE title IS NOT NULL"

batch_load_table(...) -> None Same args as load_table, plus: - batch_sz: approx. number of S3 objects per batch. Each batch is read and inserted (or CREATE on the first), then temp files are deleted.
lazy_load(...) -> Iterator[duckdb.DuckDBPyRelation] Yields one Relation per batch. You can .show(), .df(), or run more SQL. Temp files are removed after each yield.
get_table(obj_type: str, cols: list[str] | None = None) -> duckdb.DuckDBPyRelation Convenience accessor to query the created table.
s3_obj_types -> list[str] Returns supported OpenAlex object types.

Behavior & notes

Progress bars: Per-file totals (from head_object) with per-chunk callbacks.
Threading: Downloads via ThreadPoolExecutor; exceptions bubble up when futures complete.
DuckDB: Installs/loads httpfs automatically; sets PRAGMA threads to n_workers.
Cleanup: download_dir is removed at the end of load_table / each batch in batch_load_table / after each yield in lazy_load.

Testing

Dev dependencies include pytest and moto[s3] to mock S3.

# with uv
uv sync --extra dev
uv run pytest -q

Example end-to-end tests:

Mock S3 with moto, upload gzipped NDJSON to openalex bucket keys,
Patch WORKS_SCHEMA to a minimal schema for fast runs,
Run load_table, batch_load_table, and lazy_load, then assert results.

Development

Source layout: src/pyalexs3/
Typed package marker: src/pyalexs3/py.typed

License

MIT © EurekAI

Citation

If you are using this for research purpose please use this bibTex for citation:

@misc{pyalexs32025,
	author = {Adityam Ghosh},
	title = {pyalexs3},
	howpublished = {\url{https://github.com/EurekAI-Org/pyalexs3}},
	year = {2025},
	note = {[Accessed 09-10-2025]},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.8

May 23, 2026

0.1.7

Mar 21, 2026

0.1.6

Mar 15, 2026

0.1.5

Feb 22, 2026

0.1.4

Nov 15, 2025

0.1.3

Nov 12, 2025

This version

0.1.2

Oct 9, 2025

0.1.1

Oct 9, 2025

0.1.0

Oct 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyalexs3-0.1.2.tar.gz (12.1 kB view details)

Uploaded Oct 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyalexs3-0.1.2-py3-none-any.whl (11.5 kB view details)

Uploaded Oct 9, 2025 Python 3

File details

Details for the file pyalexs3-0.1.2.tar.gz.

File metadata

Download URL: pyalexs3-0.1.2.tar.gz
Upload date: Oct 9, 2025
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyalexs3-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9c220b81e75aabfa4e4204d9dceddc4d400269103ba628d2181ccd3ff4e3d737`
MD5	`d6b2cc701e8bf4c78528fd46f0c1f627`
BLAKE2b-256	`0cafd022d8807ce2d282a16598c653734d709b3cabe5dd7cb1fdae54b7919e55`

See more details on using hashes here.

File details

Details for the file pyalexs3-0.1.2-py3-none-any.whl.

File metadata

Download URL: pyalexs3-0.1.2-py3-none-any.whl
Upload date: Oct 9, 2025
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyalexs3-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb1090a0e0068cd7a6669eafbc44544ae85e68c7cecf1bebf38830a36845240b`
MD5	`a36fe7bc4ba84f1cd4edb559db9e5370`
BLAKE2b-256	`80444876801ccfea209ab0d60eb7c6bffd1f49645efcfaf302cc5f7927397a59`

See more details on using hashes here.

pyalexs3 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyAlexS3

Features

Installation

Quick start

Filter with WHERE clause

Batching and lazy load

Append in batches

Iterate lazily (no table required)

API

Behavior & notes

Testing

Development

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes