Skip to main content

A lightweight static scraping library in pure Python

Project description

Harvester: An easy-to-use Web Scraping tool.

Harvester is a lightweight, pure Python library designed for straightforward web scraping without external dependencies.

Features

  • Pure Python: No third-party dependencies required.
  • Model-Field structure: Define scraping targets using a clear, class-based approach.
  • Flexible parsing: Use Python's standard libraries to parse and extract data.

Installation

Installing via pip:

pip install harvester

Or directly from the source code:

pip install git+https://github.com/blazaid/harvester

Requirements

Harvester is compatible with Python >= 3.8 versions. There are no mandatory external dependencies. However, for certain features, the chardet library may be beneficial. If chardet is not installed, those features will be bypassed with a warning.

Usage

Define your data models by subclassing Model and specifying fields:

from harvester import Model, StringField, IntegerField

class Product(Model):
    name = StringField()
    price = IntegerField()

Parse the HTML content and extract data using the model:

from harvester import parse_html

html_content = """
<html>
<body>
    <h1 class="product-name">Example Product</h1>
    <span class="product-price">100</span>
</body>
</html>
"""

mapping = {
    "name": "h1.product-name",
    "price": "span.product-price"
}

product = parse_html(html_content, Product, mapping=mapping)
print(product.to_dict())

This will output:

{"name": "Example Product", "price": 100}

Documentation

Comprehensive documentation is forthcoming and will be available on Read the Docs. In the meantime, the source code is the best place to find information.

Contributing

Contributions are welcome! Please review the issues for current topics and feel free to submit pull requests. Also make sure to read the contributing guidelines to get started.

License

Harvester is licensed under the GNU General Public License v3.0. See the LICENSE file detailed information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harvester-0.5.2.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

harvester-0.5.2-py3-none-any.whl (45.1 kB view details)

Uploaded Python 3

File details

Details for the file harvester-0.5.2.tar.gz.

File metadata

  • Download URL: harvester-0.5.2.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for harvester-0.5.2.tar.gz
Algorithm Hash digest
SHA256 22631342adc949784832a64b0e09d35f5092022f52b334ea9f5bb09a05b78eb1
MD5 3d7bb8daedd09f6df71f57a285c42c33
BLAKE2b-256 cebc0f1728eea12ec54fe771fdddfb7d2a8fe18b604bce7befd21f5e89e4589e

See more details on using hashes here.

File details

Details for the file harvester-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: harvester-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 45.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for harvester-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 855c6246c6d53758928677f9fd8f78423c03e762b2ee8ac286ca8baf5992e2d8
MD5 ce39a3d5b8e75f100adb24d53ecc2f0c
BLAKE2b-256 54c49521247054297fd50d6bc3cf15abbdf20bcf9ece23636656a7b855738556

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page