OpenAlex S3 processor
Project description
pyAlexS3
OpenAlex S3 → DuckDB loader powered by rich progress bars.
Reads OpenAlex NDJSON dumps directly from S3 via DuckDB's httpfs extension — no downloading required.
Features
- 🚀 Direct S3 reads via DuckDB
httpfs— no local downloads - 🦆 Zero-setup DuckDB loading via
read_json_auto(...) - 🎯 Filter by date range (
YYYY-MM-DD) and by part numbers - 🔁 Resume from a specific date and part after a failure
- 🔎 Optional SQL-style
WHEREpredicate - 📊 Optional
richprogress bar showing batch progress
Installation
pip install pyalexs3
or with uv:
uv add pyalexs3
Python 3.10+ is required.
Quick Start
from pyalexs3.core import OpenAlexS3Processor
p = OpenAlexS3Processor(n_workers=4)
for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-03-01",
columns=["id", "title", "publication_year"],
):
df = rel.df()
print(df.head())
Filter with WHERE clause
for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-03-01",
columns=["id", "title", "publication_year"],
where_clause="title IS NOT NULL AND language='en'",
):
df = rel.df()
Resume After Failure
If your pipeline fails midway, resume from a specific date and part number:
for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-03-01",
resume_from="2025-01-15/5", # skip everything before 2025-01-15 part 5
):
df = rel.df()
Load Specific Parts Only
for file_batch, rel in p.lazy_load(
obj_type="works",
start_date="2025-01-01",
end_date="2025-01-01",
parts=[0, 1, 2], # only load part_000.gz, part_001.gz, part_002.gz
):
df = rel.df()
Show Progress
p = OpenAlexS3Processor(n_workers=4, show_progress=True)
for file_batch, rel in p.lazy_load(obj_type="works"):
df = rel.df()
Track Which Files Were Processed
Each lazy_load iteration yields both the file batch and the relation:
for file_batch, rel in p.lazy_load(obj_type="works"):
print(f"Processing: {file_batch}") # list of S3 keys in this batch
df = rel.df()
API
OpenAlexS3Processor(n_workers=4, **kwargs)
| Parameter | Type | Default | Description |
|---|---|---|---|
n_workers |
int |
4 |
DuckDB thread count |
show_progress |
bool |
False |
Show rich progress bar |
pragma_show_progress |
bool |
False |
Enable DuckDB internal progress bar |
lazy_load(...) -> Generator[tuple[list[str], DuckDBPyRelation], None, None]
| Parameter | Type | Default | Description |
|---|---|---|---|
obj_type |
str |
required | OpenAlex object type e.g. works, authors |
columns |
list[str] | None |
None |
Columns to select. None = all |
limit |
int | None |
None |
Max records per batch |
start_date |
str | None |
2016-06-24 |
Start of date range YYYY-mm-dd (inclusive) |
end_date |
str | None |
today | End of date range YYYY-mm-dd (inclusive) |
parts |
list[int] | None |
None |
Specific part numbers to load. None = all |
where_clause |
str | None |
None |
SQL filter. Do not include WHERE keyword |
resume_from |
str | None |
None |
Resume from YYYY-mm-dd/<part> e.g. 2025-01-15/5 |
batch_size |
int |
10 |
Number of S3 files per batch |
Yields tuple[list[str], duckdb.DuckDBPyRelation]:
list[str]— S3 keys in this batch (useful for progress tracking)DuckDBPyRelation— query the batch with.df(),.arrow(),.fetchall()
Supported Object Types
works, authors, sources, institutions, topics, keywords, publishers, funders, concepts
Behavior & Notes
- No downloads — data is read directly from S3 via DuckDB
httpfs. No temp files, no cleanup needed. - DuckDB — installs and loads
httpfsautomatically on init. SetsPRAGMA threadston_workers. - Object cache —
PRAGMA enable_object_cache=trueis set by default for repeated queries on the same files. - S3 auth — OpenAlex S3 is public. No credentials needed.
Testing
Dev dependencies include pytest.
uv sync --extra dev
uv run pytest -q
Tests mock the S3 client directly using unittest.mock to test the file listing and filtering logic without hitting real S3.
Development
- Source layout:
src/pyalexs3/ - Typed package marker:
src/pyalexs3/py.typed
License
MIT © EurekAI
Citation
If you are using this for research purposes please use this BibTeX for citation:
@misc{pyalexs32025,
author = {Adityam Ghosh},
title = {pyalexs3},
howpublished = {\url{https://github.com/EurekAI-Org/pyalexs3}},
year = {2025},
note = {[Accessed 09-10-2025]},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyalexs3-0.1.8.tar.gz.
File metadata
- Download URL: pyalexs3-0.1.8.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81d1fddb6d3e73351bd2a94ed2ffe5ed69e39501c78a36285303178839dfd3a3
|
|
| MD5 |
9d83704df447b2614f45b99ac5e1095c
|
|
| BLAKE2b-256 |
4e0f24a763ff377a5905725c63f43a32ef69dbeeed297fa99b404f1a002792d7
|
File details
Details for the file pyalexs3-0.1.8-py3-none-any.whl.
File metadata
- Download URL: pyalexs3-0.1.8-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fac5d85d5e83ccfe89a06617041e8876323ff487d7ea93db378715b17ba89c56
|
|
| MD5 |
576ed31147c85a06150e9f44ae1a776f
|
|
| BLAKE2b-256 |
87071f1ae0db0dfb62a184e6dc8ef3a8c131aedb7fe12c236bda03558935881a
|