Skip to main content

A recipe based web scraping tool.

Project description

Spider Chef 🕷️👨‍🍳

                   /\
                  /  \
                 |  _ \                   _
                 | / \ \   .--,--        / \
                 |/   \ \  `.  ,.'      /   \
                 /     \ |  |___|  /\  /     \
                /|      \|  ~  ~  /  \/       \
        _______/_|_______\ (o)(o)/___/\_____   \
       /      /  |        (______)     \    \   \_
      /      /   |                      \    \
     /      /    |                       \    \
    /      /     |                        \    \
   /     _/      |                         \    \
  /             _|                          \    \_
_/                                           \
                                              \
                                               \_

SpiderChef is a powerful, recipe-based web scraping tool that makes data extraction systematic and reproducible. By defining scraping procedures as "recipes" with sequential "steps," SpiderChef allows you to craft elegant, maintainable data extraction workflows.

Features

  • Recipe-Based Architecture: Define extraction workflows as YAML recipes
  • Modular Step System: Build complex scraping logic from reusable components
  • Async Support: Handle both synchronous and asynchronous extraction steps
  • Type Safety: Fully typed for better development experience
  • Extensible Design: Easily create custom steps for specialized extraction needs

Installation

# If you want to use the cli
pip install spiderchef[cli]

# If you just want the library usage
pip install spiderchef

CLI Usage

# Run a recipe
spiderchef cook recipes/example.yaml

# Create a new recipe template
spiderchef recipe new my_extraction

Library Usage

Basic Usage

The basic usage of this library involves just pulling a local recipe and "cooking" it to get the output data:

import asyncio
from spiderchef import Recipe

# Imports a recipe from a yaml file locally
recipe = Recipe.from_yaml('recipe_example.yaml')
# Run a recipe
asyncio.run(recipe.cook())

Custom Usage

Let's say you want to extend the steps available even more with your own custom ones, you can do it like so:

import asyncio
from typing import Any

from spiderchef import STEP_REGISTRY, AsyncStep, Recipe, SyncStep


# You can define your own custom steps like so:
class HelloStep(SyncStep):
    # .name is a reserved keyword for steps
    person_name: str

    def _execute(self, recipe: Recipe, previous_output: Any = None) -> str:
        return f"Hello There {self.person_name}"


# Sync or Async is possible.
class SleepStep(AsyncStep):
    sleep_time: int = 5

    async def _execute(self, recipe: Recipe, previous_output: Any = None) -> Any:
        await asyncio.sleep(self.sleep_time)
        return previous_output


CUSTOM_STEP_REGISTRY = {**STEP_REGISTRY, "hello": HelloStep, "sleep": SleepStep}

# Overrides the global step registry with your own
Recipe.step_registry = CUSTOM_STEP_REGISTRY

# You can manually initiziales a recipe like so, or just use the yaml recipe.
recipe = Recipe(
    base_url="https://example.com",
    name="Example",
    steps=[
        HelloStep(name="Saying Hello", person_name="George"),
        SleepStep(
            name="Sleeping",
        ),
    ],
)

# Run a recipe
asyncio.run(recipe.cook())

"""Output:
2025-05-07 16:33:01 [info     ] 🥣🥄🔥 Cooking 'Example' recipe!
2025-05-07 16:33:01 [info     ] ➡️  1. Saying Hello...         step_class=HelloStep
2025-05-07 16:33:01 [info     ] ➡️  2. Sleeping...             step_class=SleepStep
2025-05-07 16:33:06 [info     ] 🍞 'Example' recipe finished output='Hello There George'
"""

Example Recipe

base_url: https://example.com
name: ProductExtractor
steps:
  - type: fetch
    name: fetch_product_page
    page_type: text
    path: /products
    params:
      category: electronics
      sort: price_asc
  
  - type: regex
    name: extract_product_urls
    expression: '"(\/product\/[^"]+)"'
  
  - type: join_base_url
    name: format_urls

Why SpiderChef?

Traditional web scraping often involves writing complex, difficult-to-maintain code that mixes HTTP requests, parsing, and business logic. SpiderChef separates these concerns by:

  • Breaking extraction into discrete, reusable steps
  • Defining workflows as declarative recipes
  • Handling common extraction patterns with built-in steps
  • Making scraping procedures reproducible and maintainable

Whether you're scraping product data, monitoring prices, or extracting research information, SpiderChef helps you build structured, reliable data extraction pipelines.

Documentation

For full documentation, visit spiderchef.readthedocs.io.

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderchef-0.0.2.tar.gz (115.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spiderchef-0.0.2-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file spiderchef-0.0.2.tar.gz.

File metadata

  • Download URL: spiderchef-0.0.2.tar.gz
  • Upload date:
  • Size: 115.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.3

File hashes

Hashes for spiderchef-0.0.2.tar.gz
Algorithm Hash digest
SHA256 45a712a0bf3ec2f64d0e5b7e37ab2fb296b426eb3f8cde05ddb0d5a9f71b4c3a
MD5 351be21b09a67e8e8bb05be840fe4651
BLAKE2b-256 f53a32e85ec1bf8d04ede9d63b6e600447286a42612afa0dd6646b4a7d1e294c

See more details on using hashes here.

File details

Details for the file spiderchef-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: spiderchef-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.3

File hashes

Hashes for spiderchef-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 67cfa018647a80062e167f2b169a15e4dcb3fdf7385dd0264296e90ad9f0d5ae
MD5 953150801957531cd0a028146213df34
BLAKE2b-256 6ac48e728f18c3a5b1690eb7538d8f89d7451a69e5997d26fd0d8bff9ddf6d6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page