WordPress API Crawler and Backup Tool
Project description
wparc: WordPress API Crawler and Backup Tool
wparc (WordPress Archive) is a powerful command-line tool for backing up and archiving public data from WordPress websites using the WordPress REST API. It provides a simple, efficient way to extract posts, pages, media metadata, comments, and other content from any WordPress site that has the REST API enabled.
What is wparc?
wparc connects to WordPress sites via their /wp-json/ REST API endpoint (available by default in WordPress 4.7+) and extracts all publicly accessible data. Unlike traditional backup tools that require database access or FTP credentials, wparc only needs the site URL and works entirely through the public API.
Key capabilities:
- Extract all public WordPress content (posts, pages, media, comments, etc.)
- Download media files (images, videos, documents)
- Analyze and discover WordPress API routes
- Generate structured, machine-readable backups (JSONL format)
- Work with any WordPress site without special permissions
Main features
- Data extraction: Dump all WordPress REST API routes and data
- Media download: Download all media files referenced in the API
- Route analysis: Analyze and categorize WordPress API routes, automatically test unknown routes, and generate YAML updates
- Smart pagination: Automatically detects and uses WordPress pagination headers (X-WP-TotalPages, X-WP-Total) for accurate progress tracking
- Progress tracking: Shows "page X of Y" progress when pagination headers are available
- SSL verification: Secure by default with configurable SSL verification
- Configurable: Customize timeout, page size, retry count, and more
- Type-safe: Full type hints for better IDE support and code quality
- Error handling: Comprehensive error handling with custom exceptions and actionable error messages
Installation
Production Installation
pip install --upgrade pip setuptools
pip install wparc
Development Installation
git clone https://github.com/ruarxive/wparc.git
cd wparc
pip install -e ".[dev]"
Python version
Python version 3.6 or greater is required. Python 3.9+ is recommended for best performance and compatibility.
System Requirements
- Python 3.6+
- Internet connection for downloading data
- Sufficient disk space (depends on site size - typically 100MB to several GB)
- Write permissions in the current directory (for creating output folders)
Usage
Quick Start
Here's a typical workflow for backing up a WordPress site:
# Step 1: Verify the site's REST API is accessible
wparc ping example.com
# Step 2: Analyze available routes (optional, but recommended)
wparc analyze example.com --verbose
# Step 3: Dump all data from the WordPress site
wparc dump example.com --verbose
# Step 4: Download all media files
wparc getfiles example.com --verbose
Basic Commands
Get help:
wparc --help
# Or get help for a specific command
wparc ping --help
wparc dump --help
Ping a WordPress site (verify API accessibility):
wparc ping example.com
Dump all data from a WordPress site:
wparc dump example.com
Download media files (requires dump to be run first):
wparc getfiles example.com
Analyze WordPress API routes and test unknown routes:
wparc analyze example.com
Command Options
Ping Command
The ping command verifies that a WordPress site's REST API is accessible and returns basic information about available endpoints. This is useful as a first step to check if a site supports the WordPress REST API before attempting to dump data.
Syntax:
wparc ping <domain> [OPTIONS]
Options:
-v, --verbose: Enable verbose output with detailed logging information--https: Force HTTPS protocol (default: True, use--no-httpsto disable)--no-verify-ssl: Disable SSL certificate verification (not recommended for security)--timeout INTEGER: Request timeout in seconds (default: 360)
What it does:
- Connects to the WordPress REST API endpoint (
/wp-json/) - Verifies the API is accessible and responding
- Counts total available routes
- Returns endpoint URL and route count
Examples:
Basic ping to check if API is accessible:
wparc ping example.com
Ping with HTTPS and verbose output to see detailed connection information:
wparc ping example.com --https --verbose
Ping a site with self-signed SSL certificate (development/testing only):
wparc ping localhost --no-verify-ssl --no-https
Ping with custom timeout for slow connections:
wparc ping slow-site.com --timeout 600
Expected Output:
✓ Endpoint https://example.com/wp-json/ is OK
✓ Total routes: 45
Use Cases:
- Quick health check before running a full dump
- Verifying REST API is enabled on a WordPress site
- Testing connectivity and SSL configuration
- Discovering how many routes are available
Dump Command
The dump command extracts all data from a WordPress site's REST API and saves it to local JSONL files. This is the primary command for backing up WordPress content including posts, pages, media metadata, comments, users, and other API endpoints.
Syntax:
wparc dump <domain> [OPTIONS]
Options:
-v, --verbose: Enable verbose output showing detailed progress and operations-a, --all: Include unknown API routes in the dump (default: True). Set to--no-allto only dump known routes--https: Force HTTPS protocol (default: True, use--no-httpsto disable)--no-verify-ssl: Disable SSL certificate verification (not recommended for security)--timeout INTEGER: Request timeout in seconds (default: 360). Increase for slow sites or large datasets--page-size INTEGER: Number of items per page (default: 100). Lower values use less memory but more requests--retry-count INTEGER: Number of retry attempts for failed requests (default: 5)
What it does:
- Discovers all available WordPress REST API routes
- Iterates through paginated endpoints (posts, pages, media, etc.)
- Downloads all data and saves to JSONL files (one JSON object per line)
- Creates organized directory structure:
<domain>/data/ - Shows progress with pagination information when available
- Handles errors gracefully with automatic retries
Examples:
Basic dump of all data from a WordPress site:
wparc dump example.com
Dump with verbose output to see detailed progress:
wparc dump example.com --verbose
Dump only known routes (skip unknown/custom routes):
wparc dump example.com --no-all
Dump from a large site with custom settings for better performance:
wparc dump large-site.com --timeout 600 --page-size 50 --retry-count 3
Dump from a development site with self-signed certificate:
wparc dump dev.local --no-verify-ssl --no-https
Dump with HTTP instead of HTTPS (for local development):
wparc dump localhost --no-https
Expected Output:
Processing route: /wp/v2/posts
Processing page 1 of 5 (100 items per page)
Processing page 2 of 5 (100 items per page)
...
✓ Data collection complete: 45 routes processed, 2 skipped
Output Files:
After completion, you'll find files in <domain>/data/:
wp-json.json- Main API index with all routeswp_v2_posts.jsonl- All posts (one JSON object per line)wp_v2_pages.jsonl- All pageswp_v2_media.jsonl- Media metadata (usegetfilesto download actual files)wp_v2_comments.jsonl- Commentswp_v2_users.jsonl- Users (public data only)- Additional route files as discovered
Note: The dump command automatically uses WordPress pagination headers (X-WP-TotalPages and X-WP-Total) when available to show accurate progress like "Processing page 1 of 5". This provides better visibility into the extraction progress for large sites.
Use Cases:
- Full site backup before migration or updates
- Content archival and preservation
- Data analysis and research
- Creating local copies for development
- Extracting content for static site generation
Getfiles Command
The getfiles command downloads all media files (images, videos, documents, etc.) that were referenced in the media metadata collected by the dump command. It reads from wp_v2_media.jsonl and downloads each file to the local filesystem, preserving the original directory structure.
Syntax:
wparc getfiles <domain> [OPTIONS]
Options:
-v, --verbose: Enable verbose output showing download progress and file details--no-verify-ssl: Disable SSL certificate verification (not recommended for security)
What it does:
- Reads media metadata from
<domain>/data/wp_v2_media.jsonl(created bydumpcommand) - Downloads each media file referenced in the metadata
- Preserves original WordPress directory structure (
wp-content/uploads/...) - Supports resumable downloads (can be interrupted and resumed)
- Uses concurrent workers for faster downloads (default: 5 workers)
- Creates checkpoint files to track progress
Prerequisites:
- Must run
wparc dump <domain>first to generate the media metadata file - Requires the
wp_v2_media.jsonlfile to exist in<domain>/data/
Examples:
Download all media files after running dump:
# First, dump the data
wparc dump example.com
# Then download the media files
wparc getfiles example.com
Download with verbose output to see progress:
wparc getfiles example.com --verbose
Download from a site with SSL issues (development only):
wparc getfiles dev.local --no-verify-ssl
Expected Output:
Reading media files from example.com/data/wp_v2_media.jsonl
Found 1,234 media files to download
Downloading: image1.jpg [████████████] 100%
Downloading: video1.mp4 [████████████] 100%
...
✓ File download complete: 1234 downloaded, 0 failed, 0 skipped
Output Structure:
Files are downloaded to <domain>/files/wp-content/uploads/ preserving the original WordPress structure:
example.com/
└── files/
└── wp-content/
└── uploads/
├── 2024/
│ └── 12/
│ └── image.jpg
└── 2025/
└── 01/
└── video.mp4
Features:
- Resumable: If interrupted, can resume from checkpoint
- Concurrent: Downloads multiple files simultaneously (5 workers by default)
- Progress Tracking: Shows download progress for each file
- Error Handling: Continues downloading even if some files fail
- Checkpoint System: Saves progress to resume later
Use Cases:
- Complete site backup including all media files
- Migrating media files to a new server
- Creating offline archives of WordPress sites
- Downloading media for local development environments
- Preserving media assets for archival purposes
Analyze Command
The analyze command performs a comprehensive analysis of a WordPress site's REST API routes. It compares discovered routes against a database of known routes, identifies unknown routes, automatically tests them to determine their characteristics, and generates YAML updates that can be added to the known routes database.
Syntax:
wparc analyze <domain> [OPTIONS]
Options:
-v, --verbose: Enable verbose output showing detailed analysis and route testing progress--https: Force HTTPS protocol (default: True, use--no-httpsto disable)--no-verify-ssl: Disable SSL certificate verification (not recommended for security)--timeout INTEGER: Request timeout in seconds (default: 360)
What it does:
- Route Discovery: Fetches all available routes from
/wp-json/ - Route Comparison: Compares against known routes database (
known_routes.yml) - Route Categorization: Categorizes routes into:
protected: Routes requiring authentication (401/403 responses)public-list: Public routes returning arrays/lists (e.g., posts, pages)public-dict: Public routes returning objects/dictionariesuseless: Routes that don't provide useful data (individual items, regex patterns)unknown: Routes not in the known routes database
- Automatic Testing: Tests unknown routes to determine their category
- YAML Generation: Creates ready-to-use YAML for updating
known_routes.yml
Route Categories Explained:
- Protected: Requires authentication, returns 401/403 errors. Not useful for public backups.
- Public-list: Returns arrays of items (posts, pages, comments). Useful for bulk data extraction.
- Public-dict: Returns single objects/dictionaries. May contain useful site information.
- Useless: Individual item endpoints (e.g.,
/wp/v2/posts/123) or regex patterns. Not useful for bulk extraction.
Examples:
Basic analysis of a WordPress site:
wparc analyze example.com
Analysis with verbose output to see route testing details:
wparc analyze example.com --verbose
Analyze a site with custom plugins that may have unknown routes:
wparc analyze custom-site.com --verbose
Analyze a development site:
wparc analyze dev.local --no-verify-ssl --no-https
Expected Output:
✓ Analysis complete for https://example.com/wp-json/
✓ Total routes: 45
Route Statistics:
Protected: 12
Public-list: 20
Public-dict: 5
Useless: 3
Unknown: 5
⚠ Found 5 unknown routes
Testing routes: 100%|████████████| 5/5 [00:02<00:00, 2.1route/s]
✓ Testing complete for unknown routes
Categorized routes:
public-list: 3
protected: 2
======================================================================
YAML Update for known_routes.yml:
======================================================================
protected:
- /wp/v2/users/me
- /wp/v2/settings
public-list:
- /wp/v2/custom-post-type
- /wp/v2/another-route
- /wp/v2/third-route
======================================================================
You can add the above YAML to known_routes.yml
With Verbose Output:
When using --verbose, you'll see additional details:
Testing route: /wp/v2/custom-post-type
Status: 200
Response type: list
Category: public-list
Testing route: /wp/v2/users/me
Status: 401
Category: protected
...
Using the Generated YAML:
The command outputs YAML that can be directly added to wparc/data/known_routes.yml:
- Copy the YAML output from the command
- Open
wparc/data/known_routes.yml(or your local copy) - Add the routes under the appropriate category
- This helps improve route recognition for future dumps
Use Cases:
- Discovering custom WordPress plugins and their API endpoints
- Understanding what data is available from a WordPress site
- Contributing to the known routes database
- Planning data extraction strategies
- Identifying protected vs. public endpoints
- Researching WordPress API capabilities
Output Structure
After running wparc dump <domain>, the following directory structure is created in your current working directory:
<domain>/
├── data/
│ ├── wp-json.json # Main API index with all routes and endpoints
│ ├── wp_v2_posts.jsonl # All posts (one JSON object per line)
│ ├── wp_v2_pages.jsonl # All pages
│ ├── wp_v2_media.jsonl # Media metadata (URLs, titles, descriptions)
│ ├── wp_v2_comments.jsonl # Comments
│ ├── wp_v2_users.jsonl # Users (public data only)
│ ├── wp_v2_categories.jsonl # Categories
│ ├── wp_v2_tags.jsonl # Tags
│ └── ... # Other routes discovered from the API
└── files/ # Media files (created after running getfiles)
└── wp-content/
└── uploads/
├── 2024/
│ └── 12/
│ └── image.jpg
└── 2025/
└── 01/
└── video.mp4
File Formats
JSONL Format: Most data files use JSONL (JSON Lines) format where each line is a valid JSON object. This format is:
- Memory efficient (can process line by line)
- Easy to parse programmatically
- Suitable for large datasets
Example JSONL file content (wp_v2_posts.jsonl):
{"id":1,"date":"2024-01-01T00:00:00","title":{"rendered":"Hello World"},"content":{"rendered":"<p>Welcome to WordPress!</p>"},"excerpt":{"rendered":"<p>Welcome...</p>"},"author":1,"featured_media":0}
{"id":2,"date":"2024-01-02T00:00:00","title":{"rendered":"Sample Post"},"content":{"rendered":"<p>This is a sample post.</p>"},"excerpt":{"rendered":"<p>This is...</p>"},"author":1,"featured_media":123}
Reading JSONL files:
import json
with open('example.com/data/wp_v2_posts.jsonl', 'r') as f:
for line in f:
post = json.loads(line)
print(post['title']['rendered'])
Complete Backup Example
Here's a complete example of backing up a WordPress site:
# 1. Check if the site is accessible
$ wparc ping mysite.com
✓ Endpoint https://mysite.com/wp-json/ is OK
✓ Total routes: 52
# 2. Analyze routes to understand what's available
$ wparc analyze mysite.com --verbose
✓ Analysis complete for https://mysite.com/wp-json/
✓ Total routes: 52
Route Statistics:
Protected: 15
Public-list: 28
Public-dict: 4
Useless: 3
Unknown: 2
# 3. Dump all data
$ wparc dump mysite.com --verbose
Processing route: /wp/v2/posts
Processing page 1 of 12 (100 items per page)
...
✓ Data collection complete: 50 routes processed, 2 skipped
# 4. Download media files
$ wparc getfiles mysite.com --verbose
Found 1,234 media files to download
Downloading: image1.jpg [████████████] 100%
...
✓ File download complete: 1234 downloaded, 0 failed, 0 skipped
# Result: Complete backup in mysite.com/ directory
$ ls -lh mysite.com/
data/ files/
Development
Running Tests
pytest
Code Quality
# Format code
black wparc/
# Type checking
mypy wparc/
# Linting
flake8 wparc/
Common Workflows
Complete Site Backup
The most common use case - creating a complete backup of a WordPress site:
# Step 1: Verify connectivity
wparc ping example.com
# Step 2: Extract all data
wparc dump example.com --verbose
# Step 3: Download all media files
wparc getfiles example.com --verbose
Quick Content Analysis
Analyze what content is available without downloading everything:
# Get route statistics
wparc analyze example.com --verbose
# Check specific route counts
wparc ping example.com
Large Site Backup
For sites with thousands of posts or slow connections:
# Use smaller page size and longer timeout
wparc dump large-site.com \
--timeout 900 \
--page-size 25 \
--retry-count 10 \
--verbose
Development Site Backup
For local development sites or sites with self-signed certificates:
# Disable SSL verification (development only!)
wparc dump dev.local --no-verify-ssl --no-https
# Download media files
wparc getfiles dev.local --no-verify-ssl
Incremental Backup Strategy
For regular backups, you can run dump multiple times - it will overwrite existing files:
# Daily backup script
#!/bin/bash
DATE=$(date +%Y-%m-%d)
wparc dump example.com --verbose > backup-$DATE.log 2>&1
wparc getfiles example.com --verbose >> backup-$DATE.log 2>&1
Discovering Custom Endpoints
Find and document custom WordPress plugin endpoints:
# Analyze and get YAML for unknown routes
wparc analyze custom-site.com --verbose > analysis.txt
# The output will include YAML that can be added to known_routes.yml
Troubleshooting
SSL Certificate Errors
If you encounter SSL certificate errors, you can temporarily disable verification:
wparc dump example.com --no-verify-ssl
Warning: This is not recommended for production use as it makes you vulnerable to man-in-the-middle attacks.
Timeout Errors
If requests are timing out, increase the timeout:
wparc dump example.com --timeout 600
Large Sites
For large WordPress sites, you may want to adjust the page size:
wparc dump example.com --page-size 50
The dump command automatically detects pagination information from WordPress API headers, so you'll see progress like "Processing page 1 of 10" when available. This helps you estimate completion time for large extractions.
Domain Validation Errors
If you see domain validation errors, ensure you're using a valid domain format:
- Valid:
example.com,www.example.com,subdomain.example.com - Invalid:
http://example.com(protocol will be stripped automatically) - Invalid:
example.com/(trailing slash will be removed automatically)
Error Messages
wparc provides detailed error messages with actionable suggestions:
DomainValidationError: Invalid domain format
Error: Invalid domain 'example..com': Domain cannot contain consecutive dots
Solution: Check the domain name format. Valid formats: example.com, www.example.com, subdomain.example.com
APIError: WordPress API request failed
WordPress API error for https://example.com/wp-json/ (HTTP 404)
Suggestion: Check if the WordPress REST API is enabled on this site.
Solution:
- Verify the site is accessible:
curl https://example.com/wp-json/ - Check if REST API is disabled by plugins or theme
- Ensure WordPress version is 4.7+ (REST API was introduced in 4.7)
SSLVerificationError: SSL certificate verification failed
SSL verification failed for https://example.com/wp-json/: certificate verify failed
Suggestion: If you trust this site, you can use --no-verify-ssl (not recommended for production).
Solution:
- For production sites: Fix SSL certificate issues on the server
- For development/testing: Use
--no-verify-sslflag (not recommended for production)
FileDownloadError: File download failed
Failed to download https://example.com/wp-content/uploads/image.jpg: Connection timeout
Suggestion: Check your internet connection and verify the URL is accessible.
Solution:
- Check internet connectivity
- Verify the media file URL is accessible
- Try downloading manually to verify the file exists
- Check if the site requires authentication for media files
MediaFileNotFoundError: Media file list not found
Media file not found: example.com/data/wp_v2_media.jsonl
Suggestion: Run 'wparc dump <domain>' first to generate the media file list.
Solution: Run wparc dump <domain> before running wparc getfiles <domain>
Common Issues and Solutions
Issue: "Connection timeout" errors
# Solution: Increase timeout
wparc dump example.com --timeout 900
Issue: "Too many requests" or rate limiting
# Solution: Reduce page size and increase retry count
wparc dump example.com --page-size 25 --retry-count 10
Issue: "SSL certificate verify failed" on valid sites
# Solution: Update certificates (macOS/Linux)
# Or temporarily disable for testing (not recommended)
wparc dump example.com --no-verify-ssl
Issue: Dump completes but getfiles fails
# Solution: Check if wp_v2_media.jsonl exists
ls -lh example.com/data/wp_v2_media.jsonl
# If missing, the site may not have media endpoints
# Try running dump again with --verbose to see what routes were processed
Issue: Out of memory errors on large sites
# Solution: Use smaller page size
wparc dump example.com --page-size 25
Issue: Some routes return 401/403 errors
# This is normal - protected routes require authentication
# These routes are automatically skipped during dump
# Use analyze command to see which routes are protected
wparc analyze example.com
Tips & Best Practices
Performance Optimization
For large sites:
- Use smaller
--page-size(25-50) to reduce memory usage - Increase
--timeoutfor slow connections - Run during off-peak hours to avoid impacting site performance
- Use
--verboseto monitor progress
For faster downloads:
- The
getfilescommand uses 5 concurrent workers by default - Ensure stable internet connection for best results
- Consider running
getfilesseparately if dump takes a long time
Data Management
File organization:
- Each domain creates its own directory structure
- JSONL files can be processed line-by-line (memory efficient)
- Media files preserve original WordPress directory structure
Backup strategy:
- Run regular dumps to capture content changes
- Store backups in version control or cloud storage
- Consider compressing old backups to save space
Working with Custom WordPress Sites
Custom post types:
- Use
analyzecommand to discover custom endpoints - Custom routes are automatically included when using
--allflag (default) - Generated YAML from
analyzecan improve future dumps
Plugin-specific content:
- Many WordPress plugins expose their data via REST API
- Use
analyzeto discover plugin endpoints - Some plugin data may require authentication (will be skipped)
Development Workflow
Local development:
# Backup production site
wparc dump production.com
# Restore to local (requires custom import script)
# Use JSONL files to import data into local WordPress
Testing:
- Use
pingcommand to verify API accessibility - Use
analyzeto understand available endpoints - Test with
--verboseto see detailed operations
Limitations
What wparc can do:
- Extract all public WordPress content
- Download publicly accessible media files
- Work with any WordPress site (4.7+)
- Discover and analyze API routes
What wparc cannot do:
- Access private/protected content (requires authentication)
- Extract database structure or settings
- Backup WordPress core files or themes
- Access content behind paywalls or membership plugins
- Extract user passwords or sensitive data
Security
- SSL verification enabled by default: All HTTPS connections verify SSL certificates
- Secure file operations: All file operations use secure context managers
- No command injection: Safe subprocess usage prevents command injection vulnerabilities
- Error handling: Proper error handling prevents information leakage
- No authentication: Only accesses publicly available data (no credentials required or stored)
Security recommendations:
- Always use
--httpsfor production sites (default) - Only use
--no-verify-sslfor development/testing - Review downloaded content before using in production
- Keep wparc updated to latest version
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
See LICENSE file for details.
Documentation
For detailed information about WordPress REST API endpoints, see WP_API_ENDPOINTS.md.
Changelog
See CHANGELOG.md for a list of changes.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wparc-1.0.7-py3-none-any.whl.
File metadata
- Download URL: wparc-1.0.7-py3-none-any.whl
- Upload date:
- Size: 38.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cb2501f7c58dcdd40752b820ce2aa3683b9bc70ae4aef9cddf77aecdef95fb5
|
|
| MD5 |
7119d21e98fae46e7bf615d02087c9c7
|
|
| BLAKE2b-256 |
790af516842fee833f5136eeeb751d25ac7b07332e4b678aaec0f9f89f86998d
|