Skip to main content

HTML data extraction library

Project description

Pickaxe

PyPI - Python Version PyPI - Version

Pickaxe is a Python package for structured data extraction from HTML documents. It provides a simple and intuitive API for parsing HTML documents, and automatically extracting structured data from them.

Features

  • Written in Rust: Pickaxe is written in Rust, which makes it fast and memory-efficient.
  • Robust: Pickaxe uses the html5ever and selectors crate for browser-grade HTML parsing and CSS selector matching.
  • CSS Selectors & XPath: Pickaxe supports both CSS selectors and (simple) XPath expressions for querying HTML documents.

Quick Start

Python

Installation

pip install python-pickaxe

Basic Usage

from pickaxe import HtmlDocument

# Parse an HTML document
document = HtmlDocument.from_str("<html><body><h1>Hello, World!</h1></body></html>")

# Access elements using CSS selectors or XPath expressions
heading = document.find("h1")
print(heading.inner_text)  # Output: Hello, World!

heading = document.find_xpath("//h1")
print(heading.inner_text)  # Output: Hello, World!

Rust

Installation

cargo add rust-pickaxe

Basic Usage

use pickaxe::HtmlDocument;

fn main() {
    // Parse an HTML document
    let document = HtmlDocument::from_str("<html><body><h1>Hello, World!</h1></body></html>").unwrap();

    // Access elements using CSS selectors or XPath expressions
    let heading = document.find("h1").unwrap();
    println!("{}", heading.inner_text());  // Output: Hello, World!

    let heading = document.find_xpath("//h1").unwrap();
    println!("{}", heading.inner_text());  // Output: Hello, World!
}

License

This project is licensed under MIT License.

Support & Feedback

If you encounter any issues or have feedback, please open an issue. We'd love to hear from you!

Made with ❤️ by Emergent Methods

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_pickaxe-0.5.5-cp39-abi3-manylinux_2_34_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ x86-64

File details

Details for the file python_pickaxe-0.5.5-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for python_pickaxe-0.5.5-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 0f4518227fde0a1bb7bb4db9e539a5081179afb45131540a48d11a66bbb3afc1
MD5 d8a0f592a25e967f52e53a3ad73949e0
BLAKE2b-256 382f41902d0e53cfd1c31a4fbbf8dcc2b26989c64b64a0b6bbbf0d98ff9c6876

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page