Skip to main content

Sneakpeek is a framework that helps to quickly and conviniently develop scrapers. It's the best choice for scrapers that have some specific complex scraping logic that needs to be run on a constant basis.

Project description

Sneakpeek

CI PyPI version Downloads Documentation Status codecov

Sneakpeek - is a framework that helps to quickly and conviniently develop scrapers. It's the best choice for scrapers that have some specific complex scraping logic that needs to be run on a constant basis.

Demo

Here's a demo project which uses Sneakpeek framework.

You can also run the demo using Docker:

docker run -it --rm -p 8080:8080 flulemon/sneakpeek-demo

Once it has started head over to http://localhost:8080 to play around with it.

Quick start

So you want to create a new scraper, first you need to make sure you have installed Sneakpeek:

pip install sneakpeek-py

The next step would be implementing scraper logic (or so called scraper handler):

# file: demo_scraper.py

import json
import logging

from pydantic import BaseModel

from sneakpeek.scraper_context import ScraperContext
from sneakpeek.scraper_handler import ScraperHandler


# This defines model of handler parameters that are defined
# in the scraper config and then passed to the handler
class DemoScraperParams(BaseModel):
    url: str

# This is a class which actually implements logic
# Note that you need to inherit the implementation from
# the `sneakpeek.scraper_handler.ScraperHandler`
class DemoScraper(ScraperHandler):
    # You can have any dependencies you want and pass them
    # in the server configuration
    def __init__(self) -> None:
        self._logger = logging.getLogger(__name__)

    # Each handler must define its name so it later
    # can be referenced in scrapers' configuration
    @property
    def name(self) -> str:
        return "demo_scraper"

    # Some example function that processes the response
    # and extracts valuable information
    async def process_page(self, response: str):
        ...

    # This function is called by the worker to execute the logic
    # The only argument that is passed is `sneakpeek.scraper_context.ScraperContext`
    # It implements basic async HTTP client and also provides parameters
    # that are defined in the scraper config
    async def run(self, context: ScraperContext) -> str:
        params = DemoScraperParams.parse_obj(context.params)
        # Perform GET request to the URL defined in the scraper config
        response = await context.get(params.url)
        response_body = await response.text()

        # Perform some business logic on a response
        result = await self.process_page(response_body)

        # Return meaningful job summary - must return a string
        return json.dumps({
            "processed_urls": 1,
            "found_results": len(result),
        })

Now that we have some scraper logic, let's make it run periodically. To do so let's configure SneakpeekServer:

# file: main.py

from sneakpeek.models import Scraper, ScraperJobPriority, ScraperSchedule
from sneakpeek.storage.in_memory_storage import (
    InMemoryLeaseStorage,
    InMemoryScraperJobsStorage,
    InMemoryScrapersStorage,
)
from sneakpeek.logging import configure_logging
from sneakpeek.plugins.requests_logging_plugin import RequestsLoggingPlugin
from sneakpeek.scraper_config import ScraperConfig
from sneakpeek.server import SneakpeekServer

from demo_scraper import DemoScraper

# For now let's have a static list of scrapers
# but this can as well be a dynamic list which is
# stored in some SQL DB
scrapers = [
    Scraper(
        # Unique ID of the scraper
        id=1,
        # Name of the scraper
        name=f"Demo Scraper",
        # How frequent should scraper be executed
        schedule=ScraperSchedule.EVERY_MINUTE,
        # Our handler name
        handler="demo_scraper",
        # Scraper config, note that params must be successfully
        # deserialized into `DemoScraperParams` class
        config=ScraperConfig(params={"url": url}),
        # Priority of the periodic scraper jobs.
        # Note that manually invoked jobs are always
        # scheduled with `UTMOST` priority
        schedule_priority=ScraperJobPriority.UTMOST,
    )
]

# Define a storage to use to store the list of the scrapers
scrapers_storage = InMemoryScrapersStorage(scrapers)

# Define a jobs storage to use
jobs_storage = InMemoryScraperJobsStorage()

# Define a lease storage for the scheduler to ensure
# that at any point of time there's only 1 active scheduler.
# This eliminates concurrent scrapers execution
lease_storage = InMemoryLeaseStorage()

# Configure server
server = SneakpeekServer.create(
    # List of implemented scraper handlers
    handlers=[DemoScraper()],
    scrapers_storage=scrapers_storage,
    jobs_storage=jobs_storage,
    lease_storage=lease_storage,

    # List of plugins which will be invoked before request
    # is dispatched or after response is received.
    # In the example we use `sneakpeek.plugins.requests_logging_plugin.RequestsLoggingPlugin`
    # which logs all requests and responses being made
    plugins=[RequestsLoggingPlugin()],
)

if __name__ == "__main__":
    configure_logging()
    # Run server (spawns scheduler, API and worker)
    # open http://localhost:8080 and explore UI
    server.serve()

Now, the only thing is left is to actually run the server:

python run main.py

That's it! Now you can open http://localhost:8080 and explore the UI to see how you scraper is being automatically scheduled and executed.

Local handler testing

You can easily test handler without running full-featured server. Here's how you can do that for the DemoScraper that we have developed in the Quick start.

Add import in the beginning of the file:

from sneakpeek.runner import LocalRunner

And add the following lines to the end of the file:

if __name__ == "__main__":
    LocalRunner.run(
        DemoScraper(),
        ScraperConfig(
            params=DemoScraperParams(
                url="http://google.com",
            ).dict(),
        ),
        plugins=[
            RequestsLoggingPlugin(),
        ],
    )

For the argument LocalRunner.run takes:

  1. An instance of your scraper handler
  2. Scraper config
  3. [Optional] List of plugins that will be used in the handler (see full list of the plugins here)

Now you can run you handler as an ordinary Python script. Given it's in demo_scraper.py file you can use:

python demo_scraper.py

Documentation

For the full documentation please visit sneakpeek-py.readthedocs.io

Contributing

Please take a look at our contributing guidelines if you're interested in helping!

Future plans

  • Support for developing and executing scrapers right in the browser
  • Plugins for headful and headless browser engines (Selenium and Playwright)
  • SQL storage implementation
  • Advanced monitoring for the scrapers' health

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sneakpeek_py-0.2.2.tar.gz (5.6 MB view details)

Uploaded Source

Built Distribution

sneakpeek_py-0.2.2-py3-none-any.whl (5.8 MB view details)

Uploaded Python 3

File details

Details for the file sneakpeek_py-0.2.2.tar.gz.

File metadata

  • Download URL: sneakpeek_py-0.2.2.tar.gz
  • Upload date:
  • Size: 5.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for sneakpeek_py-0.2.2.tar.gz
Algorithm Hash digest
SHA256 52708b1dc5ae46b23aa031fbda96bb9b70e49cf33bd979b3fddf7064f9a8c2d6
MD5 4ab199db9c0a438a8a1772b2df62db50
BLAKE2b-256 5b3da7444dc1e995fad5a00930708862d7e5ba953b9cfe743b124c3e826b00bc

See more details on using hashes here.

File details

Details for the file sneakpeek_py-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: sneakpeek_py-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 5.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for sneakpeek_py-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 98c0fdd209387916a3ffd8734e2b09548379b0ebd5b6fd12e17d0c30b63ebb38
MD5 212a349d84598ec608f6dc75dfc06078
BLAKE2b-256 8016e0adb6bd5df767f0677d2fd49454f84bf3113bf4307a386870b3c461b140

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page