Web content extraction with screenshots and structured data

These details have not been verified by PyPI

Project links

Project description

NetPull

Web content extraction with screenshots and structured data

NetPull is a Python package for extracting web content including full-page screenshots, HTML, and structured data (headings, paragraphs, links, tables, forms, metadata). Built on Playwright for reliable cross-browser automation.

Features

📸 Full-page screenshots - Capture entire webpages including lazy-loaded content
📄 HTML extraction - Save cleaned HTML without scripts/styles
🔍 Structured data - Extract titles, headings, paragraphs, and links
📊 Advanced extraction - Tables, forms, images, and metadata (OpenGraph, Twitter Cards)
🌐 Multi-browser - Firefox, Chrome, and WebKit support
🍪 Cookie consent - Automatic handling of cookie popups
⚡ Async support - Both synchronous and asynchronous APIs
🔄 Batch processing - Extract multiple URLs with concurrency control
🎯 CLI & Library - Use as command-line tool or Python library

Installation

pip install netpull

After installation, install Playwright browsers:

playwright install firefox
# or
playwright install chrome webkit

Quick Start

Command Line

# Extract single URL
netpull https://example.com

# Extract with all features
netpull https://example.com --extract-all

# Batch extraction from file
netpull -f urls.txt --browser chrome --concurrency 5

# Custom output directory
netpull https://example.com -o ./my-output --filename-pattern {domain}_{date}

Python Library

from netpull import extract_webpage

# Basic extraction
result = extract_webpage('https://example.com')
print(result.screenshot_path)  # ./extracted/example_com_20260104_143022.png
print(result.structured_data['title'])  # Page title

Async Usage

import asyncio
from netpull import extract_webpage_async

async def main():
    result = await extract_webpage_async('https://example.com')
    print(result.structured_data)

asyncio.run(main())

Usage Examples

Advanced Configuration

from netpull import extract_webpage, ExtractionConfig, BrowserConfig
from pathlib import Path

# Configure browser
browser_config = BrowserConfig(
    browser_type='chrome',
    headless=True,
    timeout=60000  # 60 seconds
)

# Configure extraction
extraction_config = ExtractionConfig(
    output_dir=Path('./output'),
    filename_pattern='{domain}_{timestamp}',
    extract_images=True,
    extract_tables=True,
    extract_forms=True,
    extract_metadata=True,
    scroll_to_bottom=True,
    handle_cookie_consent=True
)

result = extract_webpage(
    'https://example.com',
    extraction_config=extraction_config,
    browser_config=browser_config
)

Batch Processing

from netpull import extract_batch

urls = [
    'https://example.com',
    'https://github.com',
    'https://python.org'
]

def progress(current, total, url):
    print(f"[{current}/{total}] Processing {url}")

results = extract_batch(urls, progress_callback=progress)

# Check results
for result in results:
    if result.success:
        print(f"✓ {result.url}: {result.screenshot_path}")
    else:
        print(f"✗ {result.url}: {result.error}")

Extract Specific Content

from netpull import extract_webpage, ExtractionConfig

config = ExtractionConfig(
    extract_images=True,
    extract_tables=True,
    extract_metadata=True
)

result = extract_webpage('https://example.com', extraction_config=config)

# Access extracted data
print(f"Found {len(result.images)} images")
print(f"Found {len(result.tables)} tables")
print(f"OpenGraph title: {result.metadata['opengraph'].get('og:title', 'N/A')}")

CLI Reference

Basic Usage

netpull URL [URL ...]
netpull -f FILE

Browser Options

--browser {firefox,chrome,webkit} - Browser to use (default: firefox)
--headless / --no-headless - Headless mode (default: headless)
--timeout SECONDS - Navigation timeout (default: 30)
--user-agent STRING - Custom user agent

Output Options

-o DIR / --output-dir DIR - Output directory (default: ./extracted)
--filename-pattern PATTERN - Filename pattern (default: {domain}_{timestamp})
--output-format {text,json} - CLI output format

Extraction Options

--extract-images - Extract images
--extract-tables - Extract tables
--extract-forms - Extract forms
--extract-metadata - Extract metadata
--extract-all - Enable all extraction features
--no-screenshot - Disable screenshot
--no-html - Disable HTML saving

Navigation Options

--wait-for-selector SELECTOR - Wait for CSS selector
--wait-for-timeout MS - Wait timeout (default: 1000)
--no-scroll - Disable scroll to bottom
--no-cookie-consent - Disable cookie consent handling

Batch Options

--concurrency N - Max concurrent extractions (default: 3)
--delay SECONDS - Delay between requests (default: 0)

Other Options

--retry N - Retry count on failure (default: 0)
--retry-delay SECONDS - Delay between retries (default: 5)
-v / --verbose - Verbose output
--version - Show version

Filename Patterns

Use tokens in --filename-pattern:

{domain} - Domain name (example_com)
{timestamp} - Current timestamp (20260104_143022)
{url_hash} - MD5 hash of URL (first 8 chars)
{date} - Current date (2026-01-04)
{time} - Current time (143022)

Example:

netpull https://example.com --filename-pattern "{domain}_{date}"
# Output: example_com_2026-01-04.png

Configuration Classes

BrowserConfig

from netpull import BrowserConfig

config = BrowserConfig(
    browser_type='firefox',  # 'firefox', 'chrome', or 'webkit'
    headless=True,           # Run without GUI
    timeout=30000,           # Navigation timeout (ms)
    viewport_width=1920,     # Browser width
    viewport_height=1080,    # Browser height
    user_agent=None          # Custom user agent
)

ExtractionConfig

from netpull import ExtractionConfig
from pathlib import Path

config = ExtractionConfig(
    # Output
    output_dir=Path('./extracted'),
    filename_pattern='{domain}_{timestamp}',

    # Extraction toggles
    extract_screenshot=True,
    extract_html=True,
    extract_images=False,
    extract_tables=False,
    extract_forms=False,
    extract_metadata=False,

    # Navigation
    wait_for_networkidle=True,
    wait_for_timeout=1000,
    wait_for_selector=None,
    scroll_to_bottom=True,

    # Cookie consent
    handle_cookie_consent=True,
    cookie_consent_timeout=2000,

    # Batch
    batch_concurrency=3,
    batch_delay=0
)

Result Object

The ExtractionResult object contains:

result = extract_webpage('https://example.com')

result.url                  # URL that was extracted
result.success             # True if successful, False otherwise
result.error               # Error message if failed
result.screenshot_path     # Path to screenshot (Path object)
result.html_path          # Path to HTML file (Path object)
result.structured_data    # Dict with title, main_content, links
result.images             # List of image data
result.tables             # List of table data
result.forms              # List of form data
result.metadata           # Dict with opengraph, twitter, json_ld

# Convert to dictionary for JSON
result.to_dict()

# String representation
print(result)
# ✓ https://example.com
#   Screenshot: ./extracted/example_com_20260104_143022.png
#   HTML: ./extracted/example_com_20260104_143022.html

Troubleshooting

Playwright browsers not installed

playwright install firefox
# or install all browsers
playwright install

Timeout errors

Increase timeout in browser config:

browser_config = BrowserConfig(timeout=60000)  # 60 seconds

Or via CLI:

netpull https://example.com --timeout 60

Cookie consent not handled

The package tries 11 common selectors. For sites with unusual cookie popups, disable auto-handling:

netpull https://example.com --no-cookie-consent

Memory issues with large batches

Reduce concurrency:

netpull -f urls.txt --concurrency 1

Development

Install for development

git clone https://github.com/netpull/netpull.git
cd netpull
pip install -e ".[dev]"
playwright install firefox

Run tests

pytest tests/ -v

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Credits

Built with:

Playwright - Browser automation
BeautifulSoup - HTML parsing

NetPull - Simple, powerful web content extraction

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netpull-0.1.0.tar.gz (21.5 kB view details)

Uploaded Jan 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

netpull-0.1.0-py3-none-any.whl (23.2 kB view details)

Uploaded Jan 4, 2026 Python 3

File details

Details for the file netpull-0.1.0.tar.gz.

File metadata

Download URL: netpull-0.1.0.tar.gz
Upload date: Jan 4, 2026
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for netpull-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`12c4f2d189566098d9fdd43aadef2fb1a4952a87f8ef7073bbec52c5a584f585`
MD5	`23e6084e639be3270a7e809d3ac748a3`
BLAKE2b-256	`349b90ffddb66da0396415610e41729c8baba07664271bac50f983b4c348ec5d`

See more details on using hashes here.

File details

Details for the file netpull-0.1.0-py3-none-any.whl.

File metadata

Download URL: netpull-0.1.0-py3-none-any.whl
Upload date: Jan 4, 2026
Size: 23.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for netpull-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aded1c13943817ea28909a64c9c8d1eb41b1ef2fab143d726e276f992e4b41bf`
MD5	`09f895c91c7f8fd4a0a45802aec8dbc4`
BLAKE2b-256	`cf9b42f6c15e493a12dd27fb1c69dec46fa76815c6dd27b9974330bb419927e1`

See more details on using hashes here.

netpull 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NetPull

Features

Installation

Quick Start

Command Line

Python Library

Async Usage

Usage Examples

Advanced Configuration

Batch Processing

Extract Specific Content

CLI Reference

Basic Usage

Browser Options

Output Options

Extraction Options

Navigation Options

Batch Options

Other Options

Filename Patterns

Configuration Classes

BrowserConfig

ExtractionConfig

Result Object

Troubleshooting

Playwright browsers not installed

Timeout errors

Cookie consent not handled

Memory issues with large batches

Development

Install for development

Run tests

License

Contributing

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes