Skip to main content

Extract and transform HTML page content with composable CLI tools

Project description

pagepull

Extract structured data from HTML pages via the command line.

pagepull wraps BeautifulSoup behind a simple CLI, turning common DOM extraction tasks into one-liners. Think of it as jq for HTML.

Install

pip install pagepull

Or with pipx for isolated install:

pipx install pagepull

Quick Start

# Extract the content div from a WordPress page
pagepull div entry-content https://example.com/about

# Same thing, as markdown
pagepull div entry-content --markdown https://example.com/about

# List images and check for missing alt text
pagepull images --alt https://example.com/about

# Pull meta tags
pagepull meta --title --description https://example.com/about

# Use any CSS selector
pagepull select "nav.primary a" page.html

Input

pagepull accepts three input types:

# Local file
pagepull div content page.html

# URL (fetched automatically)
pagepull div content https://example.com/page

# stdin
curl -s https://example.com | pagepull div content

Commands

div — Extract a div by class or id

pagepull div entry-content page.html
pagepull div sidebar --by id page.html
pagepull div entry-content --strip script,style --markdown page.html

images — List images with metadata

pagepull images page.html
pagepull images --alt --dimensions page.html
pagepull images --json page.html

Flags --alt to show alt text (missing alt flagged as [MISSING]) and --dimensions for width/height.

meta — Extract meta tags

pagepull meta page.html                         # all meta tags
pagepull meta --title --description page.html   # specific tags
pagepull meta --og page.html                    # Open Graph tags

links — List all links

pagepull links page.html
pagepull links --external-only page.html
pagepull links --csv page.html

headings — Heading hierarchy

pagepull headings page.html
h1: Welcome to Our Site
  h2: About Us
  h2: Services
    h3: Web Design

text — Visible text only

pagepull text page.html
pagepull text --selector "div.entry-content" page.html

select — Raw CSS selector

pagepull select "nav a" page.html
pagepull select "img[alt='']" --json page.html
pagepull select "h2 + p" --text page.html

strip — Remove elements

pagepull strip script noscript style page.html

table — Extract HTML tables

pagepull table --csv page.html
pagepull table --index 0 --json page.html

Global Flags

Flag Description
--selector <css> Scope any command to a CSS selector first
--json Structured JSON output
--csv CSV output (where applicable)
--markdown Convert HTML to markdown
--quiet Suppress headers and labels

Scoping with --selector

Any command can be scoped to a portion of the page:

# Images only within the article
pagepull images --alt --selector "article" page.html

# Links only in the footer
pagepull links --selector "footer" page.html

# Text from a specific section
pagepull text --selector "div.entry-content" page.html

Pairing with sitewalker

pagepull handles one page. sitewalker crawls sites. Together they cover site-wide extraction:

# Audit alt text across an entire site
sitewalker -p https://example.com | xargs -I{} pagepull images --alt --json {}

# Extract every page title
sitewalker -p https://example.com | xargs -I{} pagepull meta --title {}

# Pull article content as markdown
sitewalker -p https://example.com | xargs -I{} pagepull div content --markdown {}

Development

git clone git@github.com:cadentdev/pagepull.git
cd pagepull
poetry install
poetry run pytest

Requirements

  • Python 3.11+
  • Dependencies: beautifulsoup4, requests, markdownify

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagepull-0.1.0.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pagepull-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file pagepull-0.1.0.tar.gz.

File metadata

  • Download URL: pagepull-0.1.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for pagepull-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d80cb67ec00c8856645e8168206b03972dc2b675af2be8559d97134de276337b
MD5 02edec9332d08235fa4c3ce7fbb9c00f
BLAKE2b-256 d07f5378cca8c1ec4986cac5f60b80f40db07110f41c3466c84e4afdcbcb647e

See more details on using hashes here.

File details

Details for the file pagepull-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pagepull-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for pagepull-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fd805bcccd482ef40b97c7b0c95cbf469b08a08dd4a4c558d2bb9f0bb6061ee7
MD5 2fa38e9a3d3a21f4875c38ffeb80bd12
BLAKE2b-256 277fb5c7f396a8fe1185c79fedb3168a2b5f7c1f98810aa4a6f6a8c9dc3780b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page