Extract and transform HTML page content with composable CLI tools
Project description
pagepull
Extract structured data from HTML pages via the command line.
pagepull wraps BeautifulSoup behind a simple CLI, turning common DOM extraction tasks into one-liners. Think of it as jq for HTML.
Install
pip install pagepull
Or with pipx for isolated install:
pipx install pagepull
Quick Start
# Extract the content div from a WordPress page
pagepull div entry-content https://example.com/about
# Same thing, as markdown
pagepull div entry-content --markdown https://example.com/about
# List images and check for missing alt text
pagepull images --alt https://example.com/about
# Pull meta tags
pagepull meta --title --description https://example.com/about
# Use any CSS selector
pagepull select "nav.primary a" page.html
Input
pagepull accepts three input types:
# Local file
pagepull div content page.html
# URL (fetched automatically)
pagepull div content https://example.com/page
# stdin
curl -s https://example.com | pagepull div content
Commands
div — Extract a div by class or id
pagepull div entry-content page.html
pagepull div sidebar --by id page.html
pagepull div entry-content --strip script,style --markdown page.html
images — List images with metadata
pagepull images page.html
pagepull images --alt --dimensions page.html
pagepull images --json page.html
Flags --alt to show alt text (missing alt flagged as [MISSING]) and --dimensions for width/height.
meta — Extract meta tags
pagepull meta page.html # all meta tags
pagepull meta --title --description page.html # specific tags
pagepull meta --og page.html # Open Graph tags
links — List all links
pagepull links page.html
pagepull links --external-only page.html
pagepull links --csv page.html
headings — Heading hierarchy
pagepull headings page.html
h1: Welcome to Our Site
h2: About Us
h2: Services
h3: Web Design
text — Visible text only
pagepull text page.html
pagepull text --selector "div.entry-content" page.html
select — Raw CSS selector
pagepull select "nav a" page.html
pagepull select "img[alt='']" --json page.html
pagepull select "h2 + p" --text page.html
strip — Remove elements
pagepull strip script noscript style page.html
table — Extract HTML tables
pagepull table --csv page.html
pagepull table --index 0 --json page.html
Global Flags
| Flag | Description |
|---|---|
--selector <css> |
Scope any command to a CSS selector first |
--json |
Structured JSON output |
--csv |
CSV output (where applicable) |
--markdown |
Convert HTML to markdown |
--quiet |
Suppress headers and labels |
Scoping with --selector
Any command can be scoped to a portion of the page:
# Images only within the article
pagepull images --alt --selector "article" page.html
# Links only in the footer
pagepull links --selector "footer" page.html
# Text from a specific section
pagepull text --selector "div.entry-content" page.html
Pairing with sitewalker
pagepull handles one page. sitewalker crawls sites. Together they cover site-wide extraction:
# Audit alt text across an entire site
sitewalker -p https://example.com | xargs -I{} pagepull images --alt --json {}
# Extract every page title
sitewalker -p https://example.com | xargs -I{} pagepull meta --title {}
# Pull article content as markdown
sitewalker -p https://example.com | xargs -I{} pagepull div content --markdown {}
Development
git clone git@github.com:cadentdev/pagepull.git
cd pagepull
poetry install
poetry run pytest
Requirements
- Python 3.11+
- Dependencies: beautifulsoup4, requests, markdownify
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pagepull-0.1.0.tar.gz.
File metadata
- Download URL: pagepull-0.1.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d80cb67ec00c8856645e8168206b03972dc2b675af2be8559d97134de276337b
|
|
| MD5 |
02edec9332d08235fa4c3ce7fbb9c00f
|
|
| BLAKE2b-256 |
d07f5378cca8c1ec4986cac5f60b80f40db07110f41c3466c84e4afdcbcb647e
|
File details
Details for the file pagepull-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pagepull-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd805bcccd482ef40b97c7b0c95cbf469b08a08dd4a4c558d2bb9f0bb6061ee7
|
|
| MD5 |
2fa38e9a3d3a21f4875c38ffeb80bd12
|
|
| BLAKE2b-256 |
277fb5c7f396a8fe1185c79fedb3168a2b5f7c1f98810aa4a6f6a8c9dc3780b7
|