Skip to main content

Lightweight async data gathering for Python

Project description

Python Versions

DataService

Lightweight - async - data gathering for Python.

DataService is a lightweight web scraping and general purpose data gathering library for Python.

Designed for simplicity, it’s built upon common web scraping and data gathering patterns.

No complex API to learn, just standard Python idioms.

Dual synchronous and asynchronous support.

Installation

You can install DataService via pip:

pip install python-dataservice

Please note that DataService requires Python 3.11 or higher.

If you want to use PlaywrightClient, you will also need to install the playwright package:

python -m playwright install

or simply:

playwright install

How to use DataService

To start, create a DataService instance with an Iterable of Request objects. This setup provides you with an Iterator of data objects that you can then iterate over or convert to a list, tuple, a pd.DataFrame or any data structure of choice.

start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
data_service = DataService(start_requests)
data = tuple(data_service)

A Request is a Pydantic model that includes the URL to fetch, a reference to the client callable, and a callback function for parsing the Response object.

The client can be any Python callable that accepts a Request object and returns a Response object. DataService provides an HttpXClient class, which is based on the httpx library, but you are free to use your own custom async client.

The callback function processes a Response object and returns either data or additional Request objects.

In this trivial example we are requesting the Books to Scrape homepage and parsing the number of books on the page.

Example parse_books_page function:

def parse_books_page(response: Response):
    articles = response.html.find_all("article", {"class": "product_pod"})
    return {
        "url": response.url,
        "title": response.html.title.get_text(strip=True),
        "articles": len(articles),
    }

This function takes a Response object, which has a html attribute (a BeautifulSoup object of the HTML content). The function parses the HTML content and returns data.

The callback function can return or yield either data (dict or pydantic.BaseModel) or more Request objects.

If you have used Scrapy before, you will find this pattern familiar.

For more examples and advanced usage, check out the examples section.

For a detailed API reference, check out the modules section.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_dataservice-0.5.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

python_dataservice-0.5.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file python_dataservice-0.5.0.tar.gz.

File metadata

  • Download URL: python_dataservice-0.5.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/6.5.0-1025-azure

File hashes

Hashes for python_dataservice-0.5.0.tar.gz
Algorithm Hash digest
SHA256 4ae48e203178ad0e609ebdca016cf5c5a8c1cb3a38ae51d94718eb316dcac1c8
MD5 661df7226a47454cf9387b8d1caa3ce3
BLAKE2b-256 8e4e8a8d702acc03511139b896f325fe30d03b56567f4da1fd0ce52ef53a1b24

See more details on using hashes here.

Provenance

File details

Details for the file python_dataservice-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: python_dataservice-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/6.5.0-1025-azure

File hashes

Hashes for python_dataservice-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7770e08f9d9ee1e130174970867a31394be6dbdd2056f34d128faa11e1b09601
MD5 db0b6c326d7dfa0e9f61787277628ce6
BLAKE2b-256 68359e41bf373c43341fc051bd9ddea22f6cad4bb52d227d4edff906077174a3

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page