Extract and transform HTML page content with composable CLI tools

These details have not been verified by PyPI

Project description

pagepull

Extract structured data from HTML pages via the command line.

pagepull wraps BeautifulSoup behind a simple CLI, turning common DOM extraction tasks into one-liners. Think of it as jq for HTML.

Install

pip install pagepull

Or with pipx for isolated install:

pipx install pagepull

Quick Start

# Extract the content div from a WordPress page
pagepull div entry-content https://example.com/about

# Same thing, as markdown
pagepull div entry-content --markdown https://example.com/about

# List images and check for missing alt text
pagepull images --alt https://example.com/about

# Pull meta tags
pagepull meta --title --description https://example.com/about

# Use any CSS selector
pagepull select "nav.primary a" page.html

Input

pagepull accepts three input types:

# Local file
pagepull div content page.html

# URL (fetched automatically)
pagepull div content https://example.com/page

# stdin
curl -s https://example.com | pagepull div content

Commands

`div` — Extract a div by class or id

pagepull div entry-content page.html
pagepull div sidebar --by id page.html
pagepull div entry-content --strip script,style --markdown page.html

`images` — List images with metadata

pagepull images page.html
pagepull images --alt --dimensions page.html
pagepull images --json page.html

Flags --alt to show alt text (missing alt flagged as [MISSING]) and --dimensions for width/height.

`meta` — Extract meta tags

pagepull meta page.html                         # all meta tags
pagepull meta --title --description page.html   # specific tags
pagepull meta --og page.html                    # Open Graph tags

`links` — List all links

pagepull links page.html
pagepull links --external-only page.html
pagepull links --csv page.html

`headings` — Heading hierarchy

pagepull headings page.html

h1: Welcome to Our Site
  h2: About Us
  h2: Services
    h3: Web Design

`text` — Visible text only

pagepull text page.html
pagepull text --selector "div.entry-content" page.html

`select` — Raw CSS selector

pagepull select "nav a" page.html
pagepull select "img[alt='']" --json page.html
pagepull select "h2 + p" --text page.html

`strip` — Remove elements

pagepull strip script noscript style page.html

`table` — Extract HTML tables

pagepull table --csv page.html
pagepull table --index 0 --json page.html

Global Flags

Flag	Description
`--selector <css>`	Scope any command to a CSS selector first
`--json`	Structured JSON output
`--csv`	CSV output (where applicable)
`--markdown`	Convert HTML to markdown
`--quiet`	Suppress headers and labels

Scoping with `--selector`

Any command can be scoped to a portion of the page:

# Images only within the article
pagepull images --alt --selector "article" page.html

# Links only in the footer
pagepull links --selector "footer" page.html

# Text from a specific section
pagepull text --selector "div.entry-content" page.html

Pairing with sitewalker

pagepull handles one page. sitewalker crawls sites. Together they cover site-wide extraction:

# Audit alt text across an entire site
sitewalker -p https://example.com | xargs -I{} pagepull images --alt --json {}

# Extract every page title
sitewalker -p https://example.com | xargs -I{} pagepull meta --title {}

# Pull article content as markdown
sitewalker -p https://example.com | xargs -I{} pagepull div content --markdown {}

Development

git clone git@github.com:cadentdev/pagepull.git
cd pagepull
poetry install
poetry run pytest

Requirements

Python 3.11+
Dependencies: beautifulsoup4, requests, markdownify

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Apr 9, 2026

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagepull-0.1.0.tar.gz (5.0 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pagepull-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file pagepull-0.1.0.tar.gz.

File metadata

Download URL: pagepull-0.1.0.tar.gz
Upload date: Apr 6, 2026
Size: 5.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for pagepull-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d80cb67ec00c8856645e8168206b03972dc2b675af2be8559d97134de276337b`
MD5	`02edec9332d08235fa4c3ce7fbb9c00f`
BLAKE2b-256	`d07f5378cca8c1ec4986cac5f60b80f40db07110f41c3466c84e4afdcbcb647e`

See more details on using hashes here.

File details

Details for the file pagepull-0.1.0-py3-none-any.whl.

File metadata

Download URL: pagepull-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for pagepull-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd805bcccd482ef40b97c7b0c95cbf469b08a08dd4a4c558d2bb9f0bb6061ee7`
MD5	`2fa38e9a3d3a21f4875c38ffeb80bd12`
BLAKE2b-256	`277fb5c7f396a8fe1185c79fedb3168a2b5f7c1f98810aa4a6f6a8c9dc3780b7`

See more details on using hashes here.

pagepull 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pagepull

Install

Quick Start

Input

Commands

div — Extract a div by class or id

images — List images with metadata

meta — Extract meta tags

links — List all links

headings — Heading hierarchy

text — Visible text only

select — Raw CSS selector

strip — Remove elements

table — Extract HTML tables

Global Flags

Scoping with --selector

Pairing with sitewalker

Development

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`div` — Extract a div by class or id

`images` — List images with metadata

`meta` — Extract meta tags

`links` — List all links

`headings` — Heading hierarchy

`text` — Visible text only

`select` — Raw CSS selector

`strip` — Remove elements

`table` — Extract HTML tables

Scoping with `--selector`