A recipe based web scraping tool.
Project description
Spider Chef 🕷️👨🍳
/\
/ \
| _ \ _
| / \ \ .--,-- / \
|/ \ \ `. ,.' / \
/ \ | |___| /\ / \
/| \| ~ ~ / \/ \
_______/_|_______\ (o)(o)/___/\_____ \
/ / | (______) \ \ \_
/ / | \ \
/ / | \ \
/ / | \ \
/ _/ | \ \
/ _| \ \_
_/ \
\
\_
SpiderChef is a powerful, recipe-based web scraping tool that makes data extraction systematic and reproducible. By defining scraping procedures as "recipes" with sequential "steps," SpiderChef allows you to craft elegant, maintainable data extraction workflows.
Features
- Recipe-Based Architecture: Define extraction workflows as YAML recipes
- Modular Step System: Build complex scraping logic from reusable components
- Async Support: Handle both synchronous and asynchronous extraction steps
- Type Safety: Fully typed for better development experience
- Extensible Design: Easily create custom steps for specialized extraction needs
Installation
# If you want to use the cli
pip install spiderchef[cli]
# If you just want the library usage
pip install spiderchef
CLI Usage
# Run a recipe
spiderchef cook recipes/example.yaml
# Create a new recipe template
spiderchef recipe new my_extraction
Library Usage
Basic Usage
The basic usage of this library involves just pulling a local recipe and "cooking" it to get the output data:
import asyncio
from spiderchef import Recipe
# Imports a recipe from a yaml file locally
recipe = Recipe.from_yaml('recipe_example.yaml')
# Run a recipe
asyncio.run(recipe.cook())
Custom Usage
Let's say you want to extend the steps available even more with your own custom ones, you can do it like so:
import asyncio
from typing import Any
from spiderchef import STEP_REGISTRY, AsyncStep, Recipe, SyncStep
# You can define your own custom steps like so:
class HelloStep(SyncStep):
# .name is a reserved keyword for steps
person_name: str
def _execute(self, recipe: Recipe, previous_output: Any = None) -> str:
return f"Hello There {self.person_name}"
# Sync or Async is possible.
class SleepStep(AsyncStep):
sleep_time: int = 5
async def _execute(self, recipe: Recipe, previous_output: Any = None) -> Any:
await asyncio.sleep(self.sleep_time)
return previous_output
CUSTOM_STEP_REGISTRY = {**STEP_REGISTRY, "hello": HelloStep, "sleep": SleepStep}
# Overrides the global step registry with your own
Recipe.step_registry = CUSTOM_STEP_REGISTRY
# You can manually initiziales a recipe like so, or just use the yaml recipe.
recipe = Recipe(
base_url="https://example.com",
name="Example",
steps=[
HelloStep(name="Saying Hello", person_name="George"),
SleepStep(
name="Sleeping",
),
],
)
# Run a recipe
asyncio.run(recipe.cook())
"""Output:
2025-05-07 16:33:01 [info ] 🥣🥄🔥 Cooking 'Example' recipe!
2025-05-07 16:33:01 [info ] ➡️ 1. Saying Hello... step_class=HelloStep
2025-05-07 16:33:01 [info ] ➡️ 2. Sleeping... step_class=SleepStep
2025-05-07 16:33:06 [info ] 🍞 'Example' recipe finished output='Hello There George'
"""
Example Recipe
base_url: https://example.com
name: ProductExtractor
steps:
- type: fetch
name: fetch_product_page
page_type: text
path: /products
params:
category: electronics
sort: price_asc
- type: regex
name: extract_product_urls
expression: '"(\/product\/[^"]+)"'
- type: join_base_url
name: format_urls
Why SpiderChef?
Traditional web scraping often involves writing complex, difficult-to-maintain code that mixes HTTP requests, parsing, and business logic. SpiderChef separates these concerns by:
- Breaking extraction into discrete, reusable steps
- Defining workflows as declarative recipes
- Handling common extraction patterns with built-in steps
- Making scraping procedures reproducible and maintainable
Whether you're scraping product data, monitoring prices, or extracting research information, SpiderChef helps you build structured, reliable data extraction pipelines.
Documentation
For full documentation, visit spiderchef.readthedocs.io.
License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spiderchef-0.0.1.tar.gz.
File metadata
- Download URL: spiderchef-0.0.1.tar.gz
- Upload date:
- Size: 71.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.7.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7be39a999ff636ce4050ac2838b10ec2629f0c16aec55c950bf443ba2ca36801
|
|
| MD5 |
8ace86de5d4aff7b75afdeaca9221d1e
|
|
| BLAKE2b-256 |
2f1ce04aa1953ead97f47bc64eefd9a248b19ff390e6271e14868be0356e94ec
|
File details
Details for the file spiderchef-0.0.1-py3-none-any.whl.
File metadata
- Download URL: spiderchef-0.0.1-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.7.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ad2d484fb4c4888918af36b4999360c5c06664eb508935a6211901059d05c09
|
|
| MD5 |
c395433a928c8ec41181127d074211a5
|
|
| BLAKE2b-256 |
11c36364f2a1431ed58fdab1bb720d91e3b7f49b33215fc353f2ef25da8c4037
|