A recipe based web scraping tool.

These details have been verified by PyPI

Project links

source

GitHub Statistics

Maintainers

vonsteer

These details have not been verified by PyPI

Project description

Spider Chef 🕷️👨‍🍳

                   /\
                  /  \
                 |  _ \                   _
                 | / \ \   .--,--        / \
                 |/   \ \  `.  ,.'      /   \
                 /     \ |  |___|  /\  /     \
                /|      \|  ~  ~  /  \/       \
        _______/_|_______\ (o)(o)/___/\_____   \
       /      /  |        (______)     \    \   \_
      /      /   |                      \    \
     /      /    |                       \    \
    /      /     |                        \    \
   /     _/      |                         \    \
  /             _|                          \    \_
_/                                           \_

SpiderChef is a powerful, recipe-based web scraping tool that makes data extraction systematic and reproducible. By defining scraping procedures as "recipes" with sequential "steps," SpiderChef allows you to craft elegant, maintainable data extraction workflows.

Features

Recipe-Based Architecture: Define extraction workflows as YAML recipes
Modular Step System: Build complex scraping logic from reusable components
Async Support: Handle both synchronous and asynchronous extraction steps
Type Safety: Fully typed for better development experience
Extensible Design: Easily create custom steps for specialized extraction needs

Installation

# If you want to use the cli
pip install spiderchef[cli]

# If you just want the library usage
pip install spiderchef

CLI Usage

# Run a recipe
spiderchef cook recipes/example.yaml

# Create a new recipe template
spiderchef recipe new my_extraction

Library Usage

Basic Usage

The basic usage of this library involves just pulling a local recipe and "cooking" it to get the output data:

import asyncio
from spiderchef import Recipe

# Imports a recipe from a yaml file locally
recipe = Recipe.from_yaml('recipe_example.yaml')
# Run a recipe
asyncio.run(recipe.cook())

Example Recipe

base_url: https://example.com
name: ProductExtractor
steps:
  - type: fetch
    name: fetch_product_page
    page_type: text
    path: /products
    params:
      category: electronics
      sort: price_asc
  
  - type: regex
    name: extract_product_urls
    expression: '"(\/product\/[^"]+)"'
  
  - type: join_base_url
    name: format_urls

Custom Usage

Let's say you want to extend the steps available even more with your own custom ones, you can do it like so:

import asyncio
from typing import Any

from spiderchef import STEP_REGISTRY, AsyncStep, Recipe, SyncStep


# You can define your own custom steps like so:
class HelloStep(SyncStep):
    # .name is a reserved keyword for steps
    person_name: str

    def _execute(self, recipe: Recipe, previous_output: Any = None) -> str:
        return f"Hello There {self.person_name}"


# Sync or Async is possible.
class SleepStep(AsyncStep):
    sleep_time: int = 5

    async def _execute(self, recipe: Recipe, previous_output: Any = None) -> Any:
        await asyncio.sleep(self.sleep_time)
        return previous_output


CUSTOM_STEP_REGISTRY = {**STEP_REGISTRY, "hello": HelloStep, "sleep": SleepStep}

# Overrides the global step registry with your own
Recipe.step_registry = CUSTOM_STEP_REGISTRY

# You can manually initialise a recipe like so, or just use the yaml recipe.
recipe = Recipe(
    base_url="https://example.com",
    name="Example",
    steps=[
        HelloStep(name="Saying Hello", person_name="George"),
        SleepStep(
            name="Sleeping",
        ),
    ],
)

# Run a recipe
asyncio.run(recipe.cook())

"""Output:
2025-05-07 16:33:01 [info     ] 🥣🥄🔥 Cooking 'Example' recipe!
2025-05-07 16:33:01 [info     ] ➡️  1. Saying Hello...         step_class=HelloStep
2025-05-07 16:33:01 [info     ] ➡️  2. Sleeping...             step_class=SleepStep
2025-05-07 16:33:06 [info     ] 🍞 'Example' recipe finished output='Hello There George'
"""

Variable Replacement

SpiderChef supports variable replacement in your steps using the ${variable} syntax. Variables can be defined in the Recipe and will be automatically replaced when the step is executed:

recipe = Recipe(
    name="Variable Example",
    base_url="https://example.com"
    # Default variables
    variables={
        "sort_order": "price_asc",
        "category": "smartphones"
    },
    steps=[
        # Variables are replaced automatically before execution
        FetchStep(
            name="Search Products",
            path="/products"
            params={
              "category":"${category}"
              "sort":"${sort_order}"
            }
        )
    ]
)
# Uses default variables
await recipe.cook()

# Replace a specific variable, making any recipe extendable
await recipe.cook(category="books")

In YAML recipes, you can use the same syntax:

name: ProductExtractor
base_url: https://example.com
variables: # these are defaults
  category: electronics
  sort_order: price_asc
steps:
  - type: fetch
    name: fetch_product_page
    page_type: text
    path: /products
    params:
      category: ${category}
      sort: ${sort_order}

You can even save variables within the recipe to be used later using the save step.

name: ProductExtractor
base_url: https://example.com
variables:
  category: electronics
  sort_order: price_asc
steps:
  - type: fetch
    name: fetch_product_page
    path: /products
    params:
      category: ${category}
      sort: ${sort_order}
  - type: xpath
    name: extract_title
    expression: //h1
  - type: save
    variable: title

Why SpiderChef?

Traditional web scraping often involves writing complex, difficult-to-maintain code that mixes HTTP requests, parsing, and business logic. SpiderChef separates these concerns by:

Breaking extraction into discrete, reusable steps
Defining workflows as declarative recipes
Handling common extraction patterns with built-in steps
Making scraping procedures reproducible and maintainable

Whether you're scraping product data, monitoring prices, or extracting research information, SpiderChef helps you build structured, reliable data extraction pipelines.

Documentation

For full documentation, visit spiderchef.readthedocs.io.

The documentation includes:

Getting started guide
User guides for basic and advanced usage
API reference
Tutorials and examples
Contributing guidelines

To build the documentation locally:

make docs

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

source

GitHub Statistics

Maintainers

vonsteer

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

May 17, 2025

0.1.0

May 16, 2025

0.0.3

May 15, 2025

0.0.2

May 8, 2025

0.0.1

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderchef-0.1.1.tar.gz (163.3 kB view details)

Uploaded May 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spiderchef-0.1.1-py3-none-any.whl (18.0 kB view details)

Uploaded May 17, 2025 Python 3

File details

Details for the file spiderchef-0.1.1.tar.gz.

File metadata

Download URL: spiderchef-0.1.1.tar.gz
Upload date: May 17, 2025
Size: 163.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.7.5

File hashes

Hashes for spiderchef-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`91454aaf1c0dee97ae4b959e69373fb1bc32d7061d12d42ed78abc3c730dab3e`
MD5	`6f3468a144c178c8b8f2991111312b1a`
BLAKE2b-256	`a797ae4f16f5bee42c8a5db15559d07edf21b4694eb7fc29e2016a6a5c502a3a`

See more details on using hashes here.

File details

Details for the file spiderchef-0.1.1-py3-none-any.whl.

File metadata

Download URL: spiderchef-0.1.1-py3-none-any.whl
Upload date: May 17, 2025
Size: 18.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.7.5

File hashes

Hashes for spiderchef-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7a2466900ed9c3e3326ddff94bb866ab9c15b040265f6225e88cca8bafbe2cd`
MD5	`035dda9b82ce820a8d9e2d2b716377ce`
BLAKE2b-256	`27f39cc4bd11e45ca35e0a663e889ccc30605f5f7905d5ee720c22277a5d36ca`

See more details on using hashes here.

spiderchef 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Spider Chef 🕷️👨‍🍳

Features

Installation

CLI Usage

Library Usage

Basic Usage

Example Recipe

Custom Usage

Variable Replacement

Why SpiderChef?

Documentation

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes