Manage Screaming Frog configs programmatically
Project description
sf-config-builder
Manage Screaming Frog .seospiderconfig files programmatically.
Installation
pip install sf-config-builder
Requirements
- Screaming Frog SEO Spider must be installed (provides JARs for deserialization)
- Python 3.8+
Quick Start
from sfconfig import SFConfig
# Load existing config (auto-detects SF installation)
config = SFConfig.load("base.seospiderconfig")
# Or specify custom SF path
config = SFConfig.load("base.seospiderconfig", sf_path="D:/Apps/Screaming Frog SEO Spider")
# Configure for e-commerce audit
config.max_urls = 100000
config.rendering_mode = "JAVASCRIPT"
# Enable CSS/JS crawling (recommended for JS rendering)
config.set("mCrawlConfig.mCrawlCSS", True)
config.set("mCrawlConfig.mCrawlJavaScript", True)
# Add custom extractions
config.add_extraction("Price", "//span[@class='price']")
config.add_extraction("SKU", "//span[@itemprop='sku']")
config.add_extraction("Stock", ".availability", selector_type="CSS")
# Add exclude patterns
config.add_exclude(r".*\.pdf$")
config.add_exclude(r".*/admin/.*")
# Save and run
config.save("client-audit.seospiderconfig")
config.run_crawl("https://example.com", output_folder="./results")
Features
Inspect Configs
config = SFConfig.load("my.seospiderconfig")
# Get specific field
max_urls = config.get("mCrawlConfig.mMaxUrls")
# List all fields
for field in config.fields():
print(f"{field['path']}: {field['value']}")
# Filter by prefix
crawl_fields = config.fields(prefix="mCrawlConfig")
Modify Configs
# Direct field access
config.set("mCrawlConfig.mMaxUrls", 100000)
# Convenience properties
config.max_urls = 100000
config.max_depth = 10
config.rendering_mode = "JAVASCRIPT" # STATIC | JAVASCRIPT
config.robots_mode = "IGNORE" # RESPECT | IGNORE
config.crawl_delay = 0.5
config.user_agent = "MyBot/1.0"
# CSS/JS crawling & storage
config.set("mCrawlConfig.mCrawlCSS", True)
config.set("mCrawlConfig.mStoreCSS", True)
config.set("mCrawlConfig.mCrawlJavaScript", True)
config.set("mCrawlConfig.mStoreJavaScript", True)
# Performance / rate limiting
config.set("mPerformanceConfig.mLimitPerformance", True)
config.set("mPerformanceConfig.mUrlRequestsPerSecond", 5.0)
Note: When switching to
JAVASCRIPTrendering mode, Screaming Frog's GUI automatically enables CSS/JS crawling. When building configs programmatically, you should explicitly setmCrawlCSSandmCrawlJavaScripttoTrueto ensure proper rendering.
Custom Extractions
# Add extraction rules
config.add_extraction(
name="Price",
selector="//span[@class='price']",
selector_type="XPATH", # XPATH | CSS | REGEX
extract_mode="TEXT" # TEXT | HTML_ELEMENT | INNER_HTML
)
# List extractions
for ext in config.extractions:
print(f"{ext['name']}: {ext['selector']}")
# Remove by name
config.remove_extraction("Price")
# Clear all
config.clear_extractions()
Custom Searches
# Add custom search filters
config.add_custom_search(
name="Filter 1",
query=".*",
mode="CONTAINS",
data_type="REGEX",
scope="HTML",
case_sensitive=False
)
# Remove by name
config.remove_custom_search("Filter 1")
# Clear all
config.clear_custom_searches()
Custom JavaScript
# Add custom JavaScript extraction
config.add_custom_javascript(
name="Extractor 1",
javascript="return document.title;",
script_type="EXTRACTION",
timeout_secs=10,
content_types="text/html"
)
# Remove by name
config.remove_custom_javascript("Extractor 1")
# Clear all
config.clear_custom_javascript()
Exclude/Include Patterns
# Excludes (URLs matching these patterns are skipped)
config.add_exclude(r".*\.pdf$")
config.add_exclude(r".*/admin/.*")
# Includes (only URLs matching these are crawled)
config.add_include(r".*/products/.*")
# List patterns
print(config.excludes)
print(config.includes)
Compare Configs
from sfconfig import SFConfig
diff = SFConfig.diff("old.seospiderconfig", "new.seospiderconfig")
if diff.has_changes:
print(f"Found {diff.change_count} differences:")
print(diff)
# Filter by prefix
crawl_changes = diff.changes_for("mCrawlConfig")
Test Extractions
# Test selector against live URL before full crawl
result = config.test_extraction(
url="https://example.com/product",
selector="//span[@class='price']",
selector_type="XPATH"
)
if result["match_count"] > 0:
print(f"Found: {result['matches']}")
else:
print("Selector didn't match - fix before crawling")
Run Crawls
# Blocking crawl
config.run_crawl(
url="https://example.com",
output_folder="./results",
export_tabs=["Internal:All", "Response Codes:All"],
export_format="csv",
timeout=3600
)
# Async crawl
process = config.run_crawl_async(
url="https://example.com",
output_folder="./results"
)
# Do other work...
process.wait() # Block until complete
Multi-Client Workflow
from sfconfig import SFConfig
clients = [
{"domain": "client1.com", "max_urls": 50000},
{"domain": "client2.com", "max_urls": 100000},
]
for client in clients:
config = SFConfig.load("agency-base.seospiderconfig")
config.max_urls = client["max_urls"]
config.add_extraction("Price", "//span[@class='price']")
config.save(f"/tmp/{client['domain']}.seospiderconfig")
config.run_crawl(
url=f"https://{client['domain']}",
output_folder=f"./results/{client['domain']}"
)
Error Handling
from sfconfig import (
SFConfig,
SFNotFoundError,
SFValidationError,
SFParseError,
SFCrawlError
)
try:
config = SFConfig.load("my.seospiderconfig")
config.set("mInvalidField", 123)
config.save()
except SFNotFoundError:
print("Install Screaming Frog first")
except SFValidationError as e:
print(f"Invalid field: {e}")
except SFParseError as e:
print(f"Could not parse config: {e}")
except SFCrawlError as e:
print(f"Crawl failed: {e}")
Environment Variables
| Variable | Description |
|---|---|
SF_PATH |
Custom path to SF's JAR directory |
SF_CLI_PATH |
Custom path to SF CLI executable |
JAVA_HOME |
Custom Java installation path |
Architecture
User Python code
|
v
+------------------+
| sfconfig | (Python wrapper)
| - SFConfig |
| - SFDiff |
+--------+---------+
| subprocess.run()
v
+------------------+
| ConfigBuilder | (Java CLI, bundled ~50KB)
| .jar |
+--------+---------+
| classpath includes
v
+------------------+
| SF's JARs | (from user's local SF install, NOT bundled)
+------------------+
At runtime, the library builds a classpath combining:
ConfigBuilder.jar(bundled with this package){SF_INSTALL_PATH}/*(user's local Screaming Frog JARs)
This means:
- Only our small JAR is distributed (no licensing issues)
- SF's proprietary JARs are used from the user's existing installation
- Compatibility is maintained across SF versions
Development
Building the Java CLI
The Java source (ConfigBuilder.java) is included alongside the JAR in sfconfig/java/. To rebuild after modifications:
cd sfconfig/java
# Compile against SF's JARs (as compile-time dependency)
javac -cp "/path/to/Screaming Frog SEO Spider/lib/*" ConfigBuilder.java
# Package into JAR (include source for transparency)
jar cfe ConfigBuilder.jar ConfigBuilder *.class ConfigBuilder.java
# Clean up loose class files
rm -f *.class
Important: Only bundle ConfigBuilder.jar. Do NOT bundle any JARs from SF's install directory - those are proprietary and already on the user's machine.
Installing for Development
cd sf-config-tool
pip install -e ".[dev]"
pytest tests/
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sf_config_builder-0.1.6.tar.gz.
File metadata
- Download URL: sf_config_builder-0.1.6.tar.gz
- Upload date:
- Size: 62.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cd38c1a2a73e7e8e7bd441eb2aba776de0f779c691778eba43111e2b2390fa8
|
|
| MD5 |
7fae4059fc27647c31583a672ce5c0f5
|
|
| BLAKE2b-256 |
bfacfa2d897dba6ff02e428ac46283dd92a19f487bb1061667ff598cdfd4c939
|
File details
Details for the file sf_config_builder-0.1.6-py3-none-any.whl.
File metadata
- Download URL: sf_config_builder-0.1.6-py3-none-any.whl
- Upload date:
- Size: 58.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f743ea9dc3cd928ecc700766db034a4c5bf5085da74fa5689fd4aa31fba982e
|
|
| MD5 |
c032af07a63ab4a5c0d31eed28a7b6db
|
|
| BLAKE2b-256 |
268bc392ae0bab5f29c64aa304d5ca934fa4da8acb158bea2bdcd7a0161f3112
|