OpenAlex S3 processor

These details have not been verified by PyPI

Project links

Project description

pyAlexS3

OpenAlex S3 → DuckDB loader powered by rich progress bars.

Reads OpenAlex NDJSON dumps directly from S3 via DuckDB's httpfs extension — no downloading required.

Features

🚀 Direct S3 reads via DuckDB httpfs — no local downloads
🦆 Zero-setup DuckDB loading via read_json_auto(...)
🎯 Filter by date range (YYYY-MM-DD) and by part numbers
🔁 Resume from a specific date and part after a failure
🔎 Optional SQL-style WHERE predicate
📊 Optional rich progress bar showing batch progress

Installation

pip install pyalexs3

or with uv:

uv add pyalexs3

Python 3.10+ is required.

Quick Start

from pyalexs3.core import OpenAlexS3Processor

p = OpenAlexS3Processor(n_workers=4)

for file_batch, rel in p.lazy_load(
    obj_type="works",
    start_date="2025-01-01",
    end_date="2025-03-01",
    columns=["id", "title", "publication_year"],
):
    df = rel.df()
    print(df.head())

Filter with WHERE clause

for file_batch, rel in p.lazy_load(
    obj_type="works",
    start_date="2025-01-01",
    end_date="2025-03-01",
    columns=["id", "title", "publication_year"],
    where_clause="title IS NOT NULL AND language='en'",
):
    df = rel.df()

Resume After Failure

If your pipeline fails midway, resume from a specific date and part number:

for file_batch, rel in p.lazy_load(
    obj_type="works",
    start_date="2025-01-01",
    end_date="2025-03-01",
    resume_from="2025-01-15/5",  # skip everything before 2025-01-15 part 5
):
    df = rel.df()

Load Specific Parts Only

for file_batch, rel in p.lazy_load(
    obj_type="works",
    start_date="2025-01-01",
    end_date="2025-01-01",
    parts=[0, 1, 2],  # only load part_000.gz, part_001.gz, part_002.gz
):
    df = rel.df()

Show Progress

p = OpenAlexS3Processor(n_workers=4, show_progress=True)

for file_batch, rel in p.lazy_load(obj_type="works"):
    df = rel.df()

Track Which Files Were Processed

Each lazy_load iteration yields both the file batch and the relation:

for file_batch, rel in p.lazy_load(obj_type="works"):
    print(f"Processing: {file_batch}")  # list of S3 keys in this batch
    df = rel.df()

API

`OpenAlexS3Processor(n_workers=4, **kwargs)`

Parameter	Type	Default	Description
`n_workers`	`int`	`4`	DuckDB thread count
`show_progress`	`bool`	`False`	Show rich progress bar
`pragma_show_progress`	`bool`	`False`	Enable DuckDB internal progress bar

`lazy_load(...) -> Generator[tuple[list[str], DuckDBPyRelation], None, None]`

Parameter	Type	Default	Description
`obj_type`	`str`	required	OpenAlex object type e.g. `works`, `authors`
`columns`	`list[str] \| None`	`None`	Columns to select. `None` = all
`limit`	`int \| None`	`None`	Max records per batch
`start_date`	`str \| None`	`2016-06-24`	Start of date range `YYYY-mm-dd` (inclusive)
`end_date`	`str \| None`	today	End of date range `YYYY-mm-dd` (inclusive)
`parts`	`list[int] \| None`	`None`	Specific part numbers to load. `None` = all
`where_clause`	`str \| None`	`None`	SQL filter. Do not include `WHERE` keyword
`resume_from`	`str \| None`	`None`	Resume from `YYYY-mm-dd/<part>` e.g. `2025-01-15/5`
`batch_size`	`int`	`10`	Number of S3 files per batch

Yields tuple[list[str], duckdb.DuckDBPyRelation]:

list[str] — S3 keys in this batch (useful for progress tracking)
DuckDBPyRelation — query the batch with .df(), .arrow(), .fetchall()

Supported Object Types

works, authors, sources, institutions, topics, keywords, publishers, funders, concepts

Behavior & Notes

No downloads — data is read directly from S3 via DuckDB httpfs. No temp files, no cleanup needed.
DuckDB — installs and loads httpfs automatically on init. Sets PRAGMA threads to n_workers.
Object cache — PRAGMA enable_object_cache=true is set by default for repeated queries on the same files.
S3 auth — OpenAlex S3 is public. No credentials needed.

Testing

Dev dependencies include pytest.

uv sync --extra dev
uv run pytest -q

Tests mock the S3 client directly using unittest.mock to test the file listing and filtering logic without hitting real S3.

Development

Source layout: src/pyalexs3/
Typed package marker: src/pyalexs3/py.typed

License

MIT © EurekAI

Citation

If you are using this for research purposes please use this BibTeX for citation:

@misc{pyalexs32025,
    author = {Adityam Ghosh},
    title = {pyalexs3},
    howpublished = {\url{https://github.com/EurekAI-Org/pyalexs3}},
    year = {2025},
    note = {[Accessed 09-10-2025]},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.8

May 23, 2026

0.1.7

Mar 21, 2026

0.1.6

Mar 15, 2026

0.1.5

Feb 22, 2026

0.1.4

Nov 15, 2025

0.1.3

Nov 12, 2025

0.1.2

Oct 9, 2025

0.1.1

Oct 9, 2025

0.1.0

Oct 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyalexs3-0.1.8.tar.gz (9.8 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyalexs3-0.1.8-py3-none-any.whl (9.7 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file pyalexs3-0.1.8.tar.gz.

File metadata

Download URL: pyalexs3-0.1.8.tar.gz
Upload date: May 23, 2026
Size: 9.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyalexs3-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`81d1fddb6d3e73351bd2a94ed2ffe5ed69e39501c78a36285303178839dfd3a3`
MD5	`9d83704df447b2614f45b99ac5e1095c`
BLAKE2b-256	`4e0f24a763ff377a5905725c63f43a32ef69dbeeed297fa99b404f1a002792d7`

See more details on using hashes here.

File details

Details for the file pyalexs3-0.1.8-py3-none-any.whl.

File metadata

Download URL: pyalexs3-0.1.8-py3-none-any.whl
Upload date: May 23, 2026
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyalexs3-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fac5d85d5e83ccfe89a06617041e8876323ff487d7ea93db378715b17ba89c56`
MD5	`576ed31147c85a06150e9f44ae1a776f`
BLAKE2b-256	`87071f1ae0db0dfb62a184e6dc8ef3a8c131aedb7fe12c236bda03558935881a`

See more details on using hashes here.

pyalexs3 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyAlexS3

Features

Installation

Quick Start

Filter with WHERE clause

Resume After Failure

Load Specific Parts Only

Show Progress

Track Which Files Were Processed

API

OpenAlexS3Processor(n_workers=4, **kwargs)

lazy_load(...) -> Generator[tuple[list[str], DuckDBPyRelation], None, None]

Supported Object Types

Behavior & Notes

Testing

Development

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`OpenAlexS3Processor(n_workers=4, **kwargs)`

`lazy_load(...) -> Generator[tuple[list[str], DuckDBPyRelation], None, None]`