Skip to main content

Official command-line interface for OpenAlex

Project description

OpenAlex Official CLI

Official command-line interface for OpenAlex. Download work metadata and full-text content (PDFs, TEI XML) in bulk.

Note: This package was formerly known as openalex-content-downloader. If you have that installed, please switch to openalex-official.

Installation

pip install openalex-official

Quick Start

# Download metadata for works matching a filter
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --filter "topics.id:T10325"

# Download metadata + PDFs
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --filter "topics.id:T10325" \
  --content pdf

# Download metadata + PDFs + TEI XML
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --filter "topics.id:T10325" \
  --content pdf,xml

# Download specific works by ID or DOI
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --ids "W2741809807,10.1038/nature12373"

# Download from a list of IDs via stdin
cat work_ids.txt | openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --stdin

# Download to S3
openalex download \
  --api-key YOUR_API_KEY \
  --storage s3 \
  --s3-bucket my-bucket \
  --s3-prefix openalex/ \
  --filter "topics.id:T12345"

# Check API key status
openalex status --api-key YOUR_API_KEY

Features

  • Metadata-first approach - JSON metadata is always saved; content files are optional
  • High-throughput async downloads - Configurable concurrency for millions of works
  • Automatic checkpointing - Resume interrupted downloads without re-downloading
  • Adaptive rate limiting - Automatically adjusts to API conditions
  • Multiple storage backends - Local filesystem or S3
  • Progress tracking - Rich terminal UI with live stats, or headless logging
  • Flexible filtering - Use any OpenAlex filter syntax
  • Multiple input modes - Filter, explicit IDs, or piped stdin
  • DOI support - Auto-detects and resolves DOIs to OpenAlex work IDs

CLI Reference

openalex download

Download work metadata and optionally content (PDFs, TEI XML).

Option Description Default
--api-key OpenAlex API key (required) $OPENALEX_API_KEY
--output, -o Output directory ./openalex-downloads
--storage Storage backend: local or s3 local
--s3-bucket S3 bucket name -
--s3-prefix S3 key prefix ""
--filter OpenAlex filter string None (all works)
--ids Comma-separated work IDs or DOIs -
--stdin Read work IDs/DOIs from stdin false
--content Content to download: pdf, xml, or pdf,xml None (metadata only)
--nested Use nested folder structure (W##/##/) false
--workers Concurrent download workers (1-200) 50
--resume/--no-resume Resume from checkpoint true
--fresh Ignore checkpoint, start fresh false
--quiet, -q Minimal output (log file only) false
--verbose, -v Extra debug output false

openalex status

Check API key status and credit information.

Option Description
--api-key OpenAlex API key (required)

Filter Examples

# Recent articles
--filter "publication_year:>2020,type:article"

# Specific topic
--filter "topics.id:T12345"

# From a specific institution
--filter "authorships.institutions.id:I123456789"

# Open access only
--filter "open_access.is_oa:true"

# Combined filters
--filter "publication_year:2023,type:article,open_access.is_oa:true"

See OpenAlex filter documentation for all available filters.

File Organization

By default, files are saved flat in the output directory. Metadata is always saved as JSON:

output/
├── W2741809807.json     # metadata (always saved)
├── W2741809807.pdf      # content (if --content pdf)
├── W2741809807.tei.xml  # content (if --content xml)
├── W1234567890.json
└── .openalex-checkpoint.json

For large downloads (>10,000 files), use --nested to organize files in a nested structure that avoids filesystem issues:

output/
├── W27/
│   └── 41/
│       ├── W2741809807.json
│       └── W2741809807.pdf
├── W12/
│   └── 34/
│       └── W1234567890.json
└── .openalex-checkpoint.json

When downloading by DOI, files are named using the DOI (with / replaced by _):

output/
├── 10.1038_nature12373.json
└── 10.1038_nature12373.pdf

Checkpointing

The downloader automatically saves progress to .openalex-checkpoint.json in the output directory. If interrupted, run the same command again to resume.

To start fresh and ignore the checkpoint:

openalex download --api-key KEY --output ./data --fresh

Logging

All activity is logged to openalex-download.log in the output directory, regardless of terminal mode.

High-Throughput Deployment

The download speed is typically limited by network bandwidth, not the tool or API. On a typical home connection (~400 Mbps), expect ~10-15 files/sec (~1M files/day). To achieve higher throughput, deploy from a cloud environment.

Performance scaling:

Environment Bandwidth Workers Expected Rate
Home connection 400 Mbps 50 ~10-15 files/sec
Cloud VM (standard) 1-5 Gbps 100-150 ~30-50 files/sec
Cloud VM (high-perf) 10+ Gbps 200-300 ~60+ files/sec

Recommendations for large-scale downloads:

  1. Run from cloud - Deploy on AWS EC2, GCP, or Azure VMs with high network bandwidth. Instances close to Cloudflare edge locations will have lower latency.

  2. Increase workers - Use --workers 150 or higher to saturate available bandwidth. Monitor with verbose mode to find the optimal setting.

  3. Use S3 storage - For very large downloads, stream directly to S3 instead of local disk:

    openalex download \
      --api-key KEY \
      --storage s3 \
      --s3-bucket my-corpus \
      --workers 200
    
  4. Parallelize across machines - For the full corpus, run multiple instances with different filters (e.g., by publication year) on separate machines.

Roadmap

We plan to add more commands to the CLI, including:

  • CSV/JSON export of search results
  • More entity types beyond works

Have a feature request? Open an issue.

Requirements

  • Python 3.9+
  • OpenAlex API key with sufficient credits

Documentation

Full documentation: docs.openalex.org

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openalex_official-0.3.3.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openalex_official-0.3.3-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file openalex_official-0.3.3.tar.gz.

File metadata

  • Download URL: openalex_official-0.3.3.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openalex_official-0.3.3.tar.gz
Algorithm Hash digest
SHA256 e4cb4f6cedc6e85f4c1a1996949e58f1dff3e410e48e78e84e2174c06d13aa0a
MD5 fec8782f9209a3df23de299d37425cee
BLAKE2b-256 92fe914f53409c43a01b5592236623612567e4b122653b48c78a8f6858149cc1

See more details on using hashes here.

Provenance

The following attestation bundles were made for openalex_official-0.3.3.tar.gz:

Publisher: publish.yml on ourresearch/openalex-official

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openalex_official-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for openalex_official-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1797ef149c6be318794be5e615aba447a7fa676d8e0ecf1b80c048a790b1a249
MD5 10739d7e64fd05fe6c06156ed430f200
BLAKE2b-256 cbacfe03a3b842dc87b432c8ca603273cfc434cde0bd8173878f44d655f4e3bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for openalex_official-0.3.3-py3-none-any.whl:

Publisher: publish.yml on ourresearch/openalex-official

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page