HTML data extraction library

These details have not been verified by PyPI

Project links

repository

Project description

Pickaxe

PyPI - Python Version PyPI - Version

Pickaxe is a Python package for structured data extraction from HTML documents. It provides a simple and intuitive API for parsing HTML documents, and automatically extracting structured data from them.

Features

Written in Rust: Pickaxe is written in Rust, which makes it fast and memory-efficient.
Robust: Pickaxe uses the html5ever and selectors crate for browser-grade HTML parsing and CSS selector matching.
Data Maps: Pickaxe can automatically generate CSS selectors for structured data extraction using Data Maps.
CSS Selectors & XPath: Pickaxe supports both CSS selectors and (simple) XPath expressions for querying HTML documents.

Quick Start

Installation

pip install python-pickaxe

Basic Usage

from pickaxe import HtmlDocument

# Parse an HTML document
document = HtmlDocument.from_str("<html><body><h1>Hello, World!</h1></body></html>")

# Access elements using CSS selectors or XPath expressions
heading = document.find("h1")
print(heading.inner_text)  # Output: Hello, World!

heading = document.find_xpath("//h1")
print(heading.inner_text)  # Output: Hello, World!

Data Maps

Data Maps are a powerful feature of Pickaxe that allow you to automatically find the best (most concise) CSS selectors for an HTML document based on samples.

from httpx import AsyncClient
from pickaxe import Attribute, HtmlDocument, generate_data_map

# We first generate a data map using a sample HTML document, and
# examples of the data we want to extract
async with AsyncClient() as client:
    response = await client.get("http://quotes.toscrape.com/author/Albert-Einstein/")
    document = HtmlDocument.from_str(response.text)

    # In this example we want to extract the Name and Birth date of the authors.
    # The HTML documents that are used as examples must have a corresponding sample in each attribute, even if the expected value
    # is None.
    # Note: You can specify the amount of iterations to run the algorithm, but typically 1-3 is enough.
    data_map = generate_data_map(
        [document],
        [
            Attribute("name", ["Albert Einstein"]),
            Attribute("birth_date", ["March 14, 1879"]),
        ]
    )

# From this data map we can extract data from other HTML documents
async with AsyncClient() as client:
    response = await client.get("https://quotes.toscrape.com/author/J-K-Rowling/")
    document = HtmlDocument.from_str(response.text)

    data = data_map.extract(document)
    print(data.to_dict()) # Output: {'name': 'J.K. Rowling', 'birth_date': 'July 31, 1965'}

# You can serialize and deserialize the data map to JSON.
data_map_json = data_map.to_json()
data_map = DataMap.from_json(data_map_json)

# The result of `extract()` is a `StructuredData` object, which can be converted to a dictionary or JSON.
print(data.to_dict())
print(data.to_json())
print(StructuredData.from_json(my_json).to_dict())

License

This project is licensed under MIT License.

Support & Feedback

If you encounter any issues or have feedback, please open an issue. We'd love to hear from you!

Made with ❤️ by Emergent Methods

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

0.5.5

Jun 25, 2025

0.5.4

May 20, 2025

0.5.3

May 20, 2025

0.5.2

May 20, 2025

0.5.1

May 20, 2025

0.5.0

Apr 21, 2025

0.4.2

Feb 15, 2025

0.4.1

Jan 25, 2025

0.4.0

Jan 18, 2025

This version

0.3.3

Jan 12, 2025

0.3.2

Jan 12, 2025

0.3.0

Jan 12, 2025

0.2.2

Dec 28, 2024

0.2.1

Dec 28, 2024

0.2.0

Dec 26, 2024

0.1.1

Dec 25, 2024

0.0.0

Dec 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

python_pickaxe-0.3.3-cp39-abi3-manylinux_2_34_x86_64.whl (1.9 MB view details)

Uploaded Jan 12, 2025 CPython 3.9+manylinux: glibc 2.34+ x86-64

File details

Details for the file python_pickaxe-0.3.3-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

Download URL: python_pickaxe-0.3.3-cp39-abi3-manylinux_2_34_x86_64.whl
Upload date: Jan 12, 2025
Size: 1.9 MB
Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.18

File hashes

Hashes for python_pickaxe-0.3.3-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`a5536fa8ac613bb4af4a199f9b32b651d9502d8f8dcd124fc92beeb7f04a2bd3`
MD5	`207844d0746f48e2b4df277c4537e71c`
BLAKE2b-256	`21130339d3c769b214c9af4f19e5d6817f4ce917aef5938c1959206cdf4afd05`

See more details on using hashes here.

python-pickaxe 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pickaxe

Features

Quick Start

Installation

Basic Usage

Data Maps

License

Support & Feedback

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes