HTML data extraction library
Project description
Pickaxe
Pickaxe is a Python package for structured data extraction from HTML documents. It provides a simple and intuitive API for parsing HTML documents, and automatically extracting structured data from them.
Features
- Written in Rust: Pickaxe is written in Rust, which makes it fast and memory-efficient.
- Robust: Pickaxe uses the
html5everandselectorscrate for browser-grade HTML parsing and CSS selector matching. - CSS Selectors & XPath: Pickaxe supports both CSS selectors and (simple) XPath expressions for querying HTML documents.
Quick Start
Python
Installation
pip install python-pickaxe
Basic Usage
from pickaxe import HtmlDocument
# Parse an HTML document
document = HtmlDocument.from_str("<html><body><h1>Hello, World!</h1></body></html>")
# Access elements using CSS selectors or XPath expressions
heading = document.find("h1")
print(heading.inner_text) # Output: Hello, World!
heading = document.find_xpath("//h1")
print(heading.inner_text) # Output: Hello, World!
Rust
Installation
cargo add rust-pickaxe
Basic Usage
use pickaxe::HtmlDocument;
fn main() {
// Parse an HTML document
let document = HtmlDocument::from_str("<html><body><h1>Hello, World!</h1></body></html>").unwrap();
// Access elements using CSS selectors or XPath expressions
let heading = document.find("h1").unwrap();
println!("{}", heading.inner_text()); // Output: Hello, World!
let heading = document.find_xpath("//h1").unwrap();
println!("{}", heading.inner_text()); // Output: Hello, World!
}
License
This project is licensed under MIT License.
Support & Feedback
If you encounter any issues or have feedback, please open an issue. We'd love to hear from you!
Made with ❤️ by Emergent Methods
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python_pickaxe-0.5.5-cp39-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: python_pickaxe-0.5.5-cp39-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f4518227fde0a1bb7bb4db9e539a5081179afb45131540a48d11a66bbb3afc1
|
|
| MD5 |
d8a0f592a25e967f52e53a3ad73949e0
|
|
| BLAKE2b-256 |
382f41902d0e53cfd1c31a4fbbf8dcc2b26989c64b64a0b6bbbf0d98ff9c6876
|