Skip to main content

Declarative Scraping Library

Project description

LST

Declarative web scraping library built on top of monapy.

Describe scraping as a chain of data transformations — not control flow.

Installation

pip install lst

Quick Example

from lst import Fetch, Scan

# Define the scraper chain
parser = (
    Fetch()
    << Scan('.pagination a')  # Feedback loop: found links are sent back to Fetch
    >> Scan('.article a')     # Extract article links
    >> Fetch()                # Fetch each article page
    >> Scan('.title').text()  # Extract title text
)

for title in parser('https://example.com'):
    print(title)

Configuration and Argument Priority

lst uses a flexible configuration system that passes through the entire chain via **kwargs.

  • Global Arguments: Passed when you call the parser (e.g., parser(url, timeout=10)). These flow through all steps.
  • Step Arguments: Passed directly to a step's constructor (e.g., Fetch(timeout=5)).
  • Priority: Step-specific arguments have higher priority. A constructor argument overrides a global argument if both are provided.

Core Components

Fetch

Performs HTTP requests and handles link extraction.

  • Inputs: Accepts a URL string or a bs4.Tag (it automatically extracts href from <a> tags).
  • Arguments: Supports all standard requests.request arguments (headers, proxies, params, etc.), except for url which is provided by the chain.
  • User Agent: The user_agent parameter can be a string or a callable. If it's a function, it can optionally take the current url as an argument or take no arguments.
  • Outputs: Yields requests.Response objects.
Error Handling (on_error)

You can handle request failures using on_error (in the constructor) or fetch_on_error (as a global argument). The handler receives (exception, request, session).

There are only two ways to handle an error:

  • Recover: Return a requests.Response instance. You can use the provided request and session to retry or return a fallback.
  • Abort: Raise an exception to stop execution.

Scan

Parses HTML content using BeautifulSoup.

  • Inputs: Accepts requests.Response, bs4.Tag, str, or bytes.
  • Outputs: By default, it yields bs4.Tag objects.
  • Selectors: Supports CSS selectors or custom filter functions.

Transformations and Types

Scan supports one terminal transformation. Applying a transformation changes what is passed to the next step:

  • No transformation: Produces bs4.Tag instances.
  • .text(): Produces a string (inner text of the element). Supports separator and strip arguments.
  • .attr(name): Produces the value of the specified attribute.

Note: Once a transformation is applied, the chain passes strings/values. You cannot follow up with another Scan that expects HTML tags.

Operators

  • >> (Forward Bind): Passes produced values to the next step.
  • << (Feedback Bind): Sends values back to a previous step, enabling recursion and pagination.

Under the Hood: The Iterative Model

The library's design is based on the monapy execution model, which dictates how data moves through the chain:

  • Iterator-Based Execution: Everything in lst works on iterators. Each step is a generator that receives a value and yields new values.

  • Lazy Evaluation: The chain is "lazy." No requests are made or HTML parsed until you actually iterate over the parser (e.g., in a for loop).

  • Memory Efficiency: Because it uses generators, lst processes items one by one. This allows you to scrape vast amounts of data without high memory consumption.

  • Continuous Flow: In a feedback loop (<<), execution continues automatically until no more new values are produced by any step in the cycle.

For more details on the underlying principles, refer to the monapy documentation.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lst-0.5.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file lst-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: lst-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for lst-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 424fe5cb9184f0a45f8fbefc6a6f8cfc9b991b3efc33a8f7fea17a2f869e6979
MD5 0ae484c8b7b20283aac243068416c1b5
BLAKE2b-256 e5c6e66700d9175fd0553653440feb03b5d319184792b47b5a7a2ed031553bf2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page