Official command-line interface for OpenAlex
Project description
OpenAlex Official CLI
Official command-line interface for OpenAlex. Download work metadata and full-text content (PDFs, TEI XML) in bulk.
Note: This package was formerly known as
openalex-content-downloader. If you have that installed, please switch toopenalex-official.
Installation
pip install openalex-official
Quick Start
# Download metadata for works matching a filter
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--filter "topics.id:T10325"
# Download metadata + PDFs
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--filter "topics.id:T10325" \
--content pdf
# Download metadata + PDFs + TEI XML
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--filter "topics.id:T10325" \
--content pdf,xml
# Download specific works by ID or DOI
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--ids "W2741809807,10.1038/nature12373"
# Download from a list of IDs via stdin
cat work_ids.txt | openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--stdin
# Download to S3
openalex download \
--api-key YOUR_API_KEY \
--storage s3 \
--s3-bucket my-bucket \
--s3-prefix openalex/ \
--filter "topics.id:T12345"
# Check API key status
openalex status --api-key YOUR_API_KEY
Features
- Metadata-first approach - JSON metadata is always saved; content files are optional
- High-throughput async downloads - Configurable concurrency for millions of works
- Automatic checkpointing - Resume interrupted downloads without re-downloading
- Adaptive rate limiting - Automatically adjusts to API conditions
- Multiple storage backends - Local filesystem or S3
- Progress tracking - Rich terminal UI with live stats, or headless logging
- Flexible filtering - Use any OpenAlex filter syntax
- Multiple input modes - Filter, explicit IDs, or piped stdin
- DOI support - Auto-detects and resolves DOIs to OpenAlex work IDs
CLI Reference
openalex download
Download work metadata and optionally content (PDFs, TEI XML).
| Option | Description | Default |
|---|---|---|
--api-key |
OpenAlex API key (required) | $OPENALEX_API_KEY |
--output, -o |
Output directory | ./openalex-downloads |
--storage |
Storage backend: local or s3 |
local |
--s3-bucket |
S3 bucket name | - |
--s3-prefix |
S3 key prefix | "" |
--filter |
OpenAlex filter string | None (all works) |
--ids |
Comma-separated work IDs or DOIs | - |
--stdin |
Read work IDs/DOIs from stdin | false |
--content |
Content to download: pdf, xml, or pdf,xml |
None (metadata only) |
--nested |
Use nested folder structure (W##/##/) | false |
--workers |
Concurrent download workers (1-200) | 50 |
--resume/--no-resume |
Resume from checkpoint | true |
--fresh |
Ignore checkpoint, start fresh | false |
--quiet, -q |
Minimal output (log file only) | false |
--verbose, -v |
Extra debug output | false |
openalex status
Check API key status and credit information.
| Option | Description |
|---|---|
--api-key |
OpenAlex API key (required) |
Filter Examples
# Recent articles
--filter "publication_year:>2020,type:article"
# Specific topic
--filter "topics.id:T12345"
# From a specific institution
--filter "authorships.institutions.id:I123456789"
# Open access only
--filter "open_access.is_oa:true"
# Combined filters
--filter "publication_year:2023,type:article,open_access.is_oa:true"
See OpenAlex filter documentation for all available filters.
File Organization
By default, files are saved flat in the output directory. Metadata is always saved as JSON:
output/
├── W2741809807.json # metadata (always saved)
├── W2741809807.pdf # content (if --content pdf)
├── W2741809807.tei.xml # content (if --content xml)
├── W1234567890.json
└── .openalex-checkpoint.json
For large downloads (>10,000 files), use --nested to organize files in a nested structure that avoids filesystem issues:
output/
├── W27/
│ └── 41/
│ ├── W2741809807.json
│ └── W2741809807.pdf
├── W12/
│ └── 34/
│ └── W1234567890.json
└── .openalex-checkpoint.json
When downloading by DOI, files are named using the DOI (with / replaced by _):
output/
├── 10.1038_nature12373.json
└── 10.1038_nature12373.pdf
Checkpointing
The downloader automatically saves progress to .openalex-checkpoint.json in the output directory. If interrupted, run the same command again to resume.
To start fresh and ignore the checkpoint:
openalex download --api-key KEY --output ./data --fresh
Logging
All activity is logged to openalex-download.log in the output directory, regardless of terminal mode.
High-Throughput Deployment
The download speed is typically limited by network bandwidth, not the tool or API. On a typical home connection (~400 Mbps), expect ~10-15 files/sec (~1M files/day). To achieve higher throughput, deploy from a cloud environment.
Performance scaling:
| Environment | Bandwidth | Workers | Expected Rate |
|---|---|---|---|
| Home connection | 400 Mbps | 50 | ~10-15 files/sec |
| Cloud VM (standard) | 1-5 Gbps | 100-150 | ~30-50 files/sec |
| Cloud VM (high-perf) | 10+ Gbps | 200-300 | ~60+ files/sec |
Recommendations for large-scale downloads:
-
Run from cloud - Deploy on AWS EC2, GCP, or Azure VMs with high network bandwidth. Instances close to Cloudflare edge locations will have lower latency.
-
Increase workers - Use
--workers 150or higher to saturate available bandwidth. Monitor with verbose mode to find the optimal setting. -
Use S3 storage - For very large downloads, stream directly to S3 instead of local disk:
openalex download \ --api-key KEY \ --storage s3 \ --s3-bucket my-corpus \ --workers 200
-
Parallelize across machines - For the full corpus, run multiple instances with different filters (e.g., by publication year) on separate machines.
Roadmap
We plan to add more commands to the CLI, including:
- CSV/JSON export of search results
- More entity types beyond works
Have a feature request? Open an issue.
Requirements
- Python 3.9+
- OpenAlex API key with sufficient credits
Documentation
Full documentation: docs.openalex.org
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openalex_official-0.3.2.tar.gz.
File metadata
- Download URL: openalex_official-0.3.2.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
973bdd0f8e362f8ea5639dee86a37856785ae10bc2cea903a283a63dd1595c2f
|
|
| MD5 |
6ba6d80d4239fe2d8eff18d32b697fd7
|
|
| BLAKE2b-256 |
9f2070ea9dc0293869dcf1ef55728da21145924f110da392f45e4d6e951fee80
|
File details
Details for the file openalex_official-0.3.2-py3-none-any.whl.
File metadata
- Download URL: openalex_official-0.3.2-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a08349961e1013b9f4ac590b5e0b8f0db72a5db14337152734e313995a7ed6e
|
|
| MD5 |
548cc01de00852894942d242dcc4fd53
|
|
| BLAKE2b-256 |
581fa62970934fd57963044c9613f8e6e99ccfe9ed11c887424925559fef7725
|