python-dataservice

Lightweight async data gathering for Python

These details have not been verified by PyPI

Project links

Documentation

Project description

DataService is a lightweight data gathering library for Python.

Designed for simplicity, it’s built upon common web scraping and data gathering patterns.

No complex API to learn, just standard Python idioms.

Async implementation, sync interface.

Installation

You can install DataService via pip:

pip install python-dataservice

Please note that this initial version requires Python 3.12 or higher. For future releases I am aiming to support older versions of Python.

How to use DataService

To start, create a DataService instance with an Iterable of Request objects. This setup provides you with an Iterator of data objects that you can then iterate over or convert to a list, tuple, a pd.DataFrame or any data structure of choice.

start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
data_service = DataService(start_requests)
data = tuple(data_service)

A Request is a Pydantic model that includes the URL to fetch, a reference to the client callable, and a callback function for parsing the Response object.

The client can be any Python callable that accepts a Request object and returns a Response object. DataService provides an HttpXClient class, which is based on the httpx library, but you are free to use your own custom async client.

The callback function processes a Response object and returns either data or additional Request objects.

In this trivial example we are requesting the Books to Scrape homepage and parsing the number of books on the page.

Example parse_books_page function:

def parse_books_page(response: Response):
    articles = response.html.find_all("article", {"class": "product_pod"})
    return {
        "url": response.request.url,
        "title": response.html.title.get_text(strip=True),
        "articles": len(articles),
    }

This function takes a Response object, which has a html attribute (a BeautifulSoup object of the HTML content). The function parses the HTML content and returns data.

The callback function can return or yield either data (dict or pydantic.BaseModel) or more Request objects.

If you have used Scrapy before, you will find this pattern familiar.

For more examples and advanced usage, check out the examples section.

For a detailed API reference, check out the modules section.

Project details

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.0.15

Sep 20, 2024

0.0.14

Sep 17, 2024

0.0.13

Sep 17, 2024

0.0.12

Sep 17, 2024

0.0.11

Sep 17, 2024

0.0.1

Sep 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_dataservice-0.0.15.tar.gz (13.7 kB view hashes)

Uploaded Sep 20, 2024 Source

Built Distribution

python_dataservice-0.0.15-py3-none-any.whl (16.4 kB view hashes)

Uploaded Sep 20, 2024 Python 3

Hashes for python_dataservice-0.0.15.tar.gz

Hashes for python_dataservice-0.0.15.tar.gz
Algorithm	Hash digest
SHA256	`101a8aa2f673adc2b14d33fdde41b4bda1e988bd24a94b01079dca00313b84c5`
MD5	`3267202ed4df1571a78c7dc1a07cfc0a`
BLAKE2b-256	`47daa3fd2fdc44369a0607c3ee28c6635db1ead050876740717909d5fbaf9103`

Hashes for python_dataservice-0.0.15-py3-none-any.whl

Hashes for python_dataservice-0.0.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`885ceeefaede00c727b5294c3e59de3aa5414734380f883d1d0b2ae247a90d92`
MD5	`599670a78075f47c014114546756e20d`
BLAKE2b-256	`ea4f75c339fe3eb1546070b2f11c54044ffedc92008a09b5683b65b08eea6923`