Declarative Scraping Library
Project description
LST
Declarative web scraping library built on top of monapy.
Describe scraping as a chain of data transformations — not control flow.
Installation
pip install lst
Quick Example
from lst import Fetch, Scan
# Define the scraper chain
parser = (
Fetch()
<< Scan('.pagination a') # Feedback loop: found links are sent back to Fetch
>> Scan('.article a') # Extract article links
>> Fetch() # Fetch each article page
>> Scan('.title').text() # Extract title text
)
for title in parser('https://example.com'):
print(title)
Configuration and Argument Priority
lst uses a flexible configuration system that passes through the entire chain via **kwargs.
- Global Arguments: Passed when you call the parser (e.g.,
parser(url, timeout=10)). These flow through all steps. - Step Arguments: Passed directly to a step's constructor (e.g.,
Fetch(timeout=5)). - Priority: Step-specific arguments have higher priority. A constructor argument overrides a global argument if both are provided.
Core Components
Fetch
Performs HTTP requests and handles link extraction.
- Inputs: Accepts a URL string or a
bs4.Tag(it automatically extractshreffrom<a>tags). - Arguments: Supports all standard
requests.requestarguments (headers, proxies, params, etc.), except forurlwhich is provided by the chain. - User Agent: The
user_agentparameter can be a string or a callable. If it's a function, it can optionally take the currenturlas an argument or take no arguments. - Outputs: Yields
requests.Responseobjects.
Error Handling (on_error)
You can handle request failures using on_error (in the constructor) or fetch_on_error (as a global argument). The handler receives (exception, request, session).
There are only two ways to handle an error:
- Recover: Return a
requests.Responseinstance. You can use the providedrequestandsessionto retry or return a fallback. - Abort: Raise an exception to stop execution.
Scan
Parses HTML content using BeautifulSoup.
- Inputs: Accepts
requests.Response,bs4.Tag,str, orbytes. - Outputs: By default, it yields
bs4.Tagobjects. - Selectors: Supports CSS selectors or custom filter functions.
Transformations and Types
Scan supports one terminal transformation. Applying a transformation changes what is passed to the next step:
- No transformation: Produces bs4.Tag instances.
.text(): Produces a string (inner text of the element). Supportsseparatorandstriparguments..attr(name): Produces the value of the specified attribute.
Note: Once a transformation is applied, the chain passes strings/values. You cannot follow up with another Scan that expects HTML tags.
Operators
>>(Forward Bind): Passes produced values to the next step.<<(Feedback Bind): Sends values back to a previous step, enabling recursion and pagination.
Under the Hood: The Iterative Model
The library's design is based on the monapy execution model, which dictates how data moves through the chain:
-
Iterator-Based Execution: Everything in
lstworks on iterators. Each step is a generator that receives a value and yields new values. -
Lazy Evaluation: The chain is "lazy." No requests are made or HTML parsed until you actually iterate over the parser (e.g., in a
forloop). -
Memory Efficiency: Because it uses generators,
lstprocesses items one by one. This allows you to scrape vast amounts of data without high memory consumption. -
Continuous Flow: In a feedback loop (
<<), execution continues automatically until no more new values are produced by any step in the cycle.
For more details on the underlying principles, refer to the monapy documentation.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lst-0.5.0-py3-none-any.whl.
File metadata
- Download URL: lst-0.5.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
424fe5cb9184f0a45f8fbefc6a6f8cfc9b991b3efc33a8f7fea17a2f869e6979
|
|
| MD5 |
0ae484c8b7b20283aac243068416c1b5
|
|
| BLAKE2b-256 |
e5c6e66700d9175fd0553653440feb03b5d319184792b47b5a7a2ed031553bf2
|