Skip to main content

Web content extraction with screenshots and structured data

Project description

NetPull

Web content extraction with screenshots and structured data

NetPull is a Python package for extracting web content including full-page screenshots, HTML, and structured data (headings, paragraphs, links, tables, forms, metadata). Built on Playwright for reliable cross-browser automation.

Features

  • 📸 Full-page screenshots - Capture entire webpages including lazy-loaded content
  • 📄 HTML extraction - Save cleaned HTML without scripts/styles
  • 🔍 Structured data - Extract titles, headings, paragraphs, and links
  • 📊 Advanced extraction - Tables, forms, images, and metadata (OpenGraph, Twitter Cards)
  • 🌐 Multi-browser - Firefox, Chrome, and WebKit support
  • 🍪 Cookie consent - Automatic handling of cookie popups
  • Async support - Both synchronous and asynchronous APIs
  • 🔄 Batch processing - Extract multiple URLs with concurrency control
  • 🎯 CLI & Library - Use as command-line tool or Python library

Installation

pip install netpull

After installation, install Playwright browsers:

playwright install firefox
# or
playwright install chrome webkit

Quick Start

Command Line

# Extract single URL
netpull https://example.com

# Extract with all features
netpull https://example.com --extract-all

# Batch extraction from file
netpull -f urls.txt --browser chrome --concurrency 5

# Custom output directory
netpull https://example.com -o ./my-output --filename-pattern {domain}_{date}

Python Library

from netpull import extract_webpage

# Basic extraction
result = extract_webpage('https://example.com')
print(result.screenshot_path)  # ./extracted/example_com_20260104_143022.png
print(result.structured_data['title'])  # Page title

Async Usage

import asyncio
from netpull import extract_webpage_async

async def main():
    result = await extract_webpage_async('https://example.com')
    print(result.structured_data)

asyncio.run(main())

Usage Examples

Advanced Configuration

from netpull import extract_webpage, ExtractionConfig, BrowserConfig
from pathlib import Path

# Configure browser
browser_config = BrowserConfig(
    browser_type='chrome',
    headless=True,
    timeout=60000  # 60 seconds
)

# Configure extraction
extraction_config = ExtractionConfig(
    output_dir=Path('./output'),
    filename_pattern='{domain}_{timestamp}',
    extract_images=True,
    extract_tables=True,
    extract_forms=True,
    extract_metadata=True,
    scroll_to_bottom=True,
    handle_cookie_consent=True
)

result = extract_webpage(
    'https://example.com',
    extraction_config=extraction_config,
    browser_config=browser_config
)

Batch Processing

from netpull import extract_batch

urls = [
    'https://example.com',
    'https://github.com',
    'https://python.org'
]

def progress(current, total, url):
    print(f"[{current}/{total}] Processing {url}")

results = extract_batch(urls, progress_callback=progress)

# Check results
for result in results:
    if result.success:
        print(f"✓ {result.url}: {result.screenshot_path}")
    else:
        print(f"✗ {result.url}: {result.error}")

Extract Specific Content

from netpull import extract_webpage, ExtractionConfig

config = ExtractionConfig(
    extract_images=True,
    extract_tables=True,
    extract_metadata=True
)

result = extract_webpage('https://example.com', extraction_config=config)

# Access extracted data
print(f"Found {len(result.images)} images")
print(f"Found {len(result.tables)} tables")
print(f"OpenGraph title: {result.metadata['opengraph'].get('og:title', 'N/A')}")

CLI Reference

Basic Usage

netpull URL [URL ...]
netpull -f FILE

Browser Options

  • --browser {firefox,chrome,webkit} - Browser to use (default: firefox)
  • --headless / --no-headless - Headless mode (default: headless)
  • --timeout SECONDS - Navigation timeout (default: 30)
  • --user-agent STRING - Custom user agent

Output Options

  • -o DIR / --output-dir DIR - Output directory (default: ./extracted)
  • --filename-pattern PATTERN - Filename pattern (default: {domain}_{timestamp})
  • --output-format {text,json} - CLI output format

Extraction Options

  • --extract-images - Extract images
  • --extract-tables - Extract tables
  • --extract-forms - Extract forms
  • --extract-metadata - Extract metadata
  • --extract-all - Enable all extraction features
  • --no-screenshot - Disable screenshot
  • --no-html - Disable HTML saving

Navigation Options

  • --wait-for-selector SELECTOR - Wait for CSS selector
  • --wait-for-timeout MS - Wait timeout (default: 1000)
  • --no-scroll - Disable scroll to bottom
  • --no-cookie-consent - Disable cookie consent handling

Batch Options

  • --concurrency N - Max concurrent extractions (default: 3)
  • --delay SECONDS - Delay between requests (default: 0)

Other Options

  • --retry N - Retry count on failure (default: 0)
  • --retry-delay SECONDS - Delay between retries (default: 5)
  • -v / --verbose - Verbose output
  • --version - Show version

Filename Patterns

Use tokens in --filename-pattern:

  • {domain} - Domain name (example_com)
  • {timestamp} - Current timestamp (20260104_143022)
  • {url_hash} - MD5 hash of URL (first 8 chars)
  • {date} - Current date (2026-01-04)
  • {time} - Current time (143022)

Example:

netpull https://example.com --filename-pattern "{domain}_{date}"
# Output: example_com_2026-01-04.png

Configuration Classes

BrowserConfig

from netpull import BrowserConfig

config = BrowserConfig(
    browser_type='firefox',  # 'firefox', 'chrome', or 'webkit'
    headless=True,           # Run without GUI
    timeout=30000,           # Navigation timeout (ms)
    viewport_width=1920,     # Browser width
    viewport_height=1080,    # Browser height
    user_agent=None          # Custom user agent
)

ExtractionConfig

from netpull import ExtractionConfig
from pathlib import Path

config = ExtractionConfig(
    # Output
    output_dir=Path('./extracted'),
    filename_pattern='{domain}_{timestamp}',

    # Extraction toggles
    extract_screenshot=True,
    extract_html=True,
    extract_images=False,
    extract_tables=False,
    extract_forms=False,
    extract_metadata=False,

    # Navigation
    wait_for_networkidle=True,
    wait_for_timeout=1000,
    wait_for_selector=None,
    scroll_to_bottom=True,

    # Cookie consent
    handle_cookie_consent=True,
    cookie_consent_timeout=2000,

    # Batch
    batch_concurrency=3,
    batch_delay=0
)

Result Object

The ExtractionResult object contains:

result = extract_webpage('https://example.com')

result.url                  # URL that was extracted
result.success             # True if successful, False otherwise
result.error               # Error message if failed
result.screenshot_path     # Path to screenshot (Path object)
result.html_path          # Path to HTML file (Path object)
result.structured_data    # Dict with title, main_content, links
result.images             # List of image data
result.tables             # List of table data
result.forms              # List of form data
result.metadata           # Dict with opengraph, twitter, json_ld

# Convert to dictionary for JSON
result.to_dict()

# String representation
print(result)
# ✓ https://example.com
#   Screenshot: ./extracted/example_com_20260104_143022.png
#   HTML: ./extracted/example_com_20260104_143022.html

Troubleshooting

Playwright browsers not installed

playwright install firefox
# or install all browsers
playwright install

Timeout errors

Increase timeout in browser config:

browser_config = BrowserConfig(timeout=60000)  # 60 seconds

Or via CLI:

netpull https://example.com --timeout 60

Cookie consent not handled

The package tries 11 common selectors. For sites with unusual cookie popups, disable auto-handling:

netpull https://example.com --no-cookie-consent

Memory issues with large batches

Reduce concurrency:

netpull -f urls.txt --concurrency 1

Development

Install for development

git clone https://github.com/netpull/netpull.git
cd netpull
pip install -e ".[dev]"
playwright install firefox

Run tests

pytest tests/ -v

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Credits

Built with:


NetPull - Simple, powerful web content extraction

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netpull-0.1.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

netpull-0.1.0-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file netpull-0.1.0.tar.gz.

File metadata

  • Download URL: netpull-0.1.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for netpull-0.1.0.tar.gz
Algorithm Hash digest
SHA256 12c4f2d189566098d9fdd43aadef2fb1a4952a87f8ef7073bbec52c5a584f585
MD5 23e6084e639be3270a7e809d3ac748a3
BLAKE2b-256 349b90ffddb66da0396415610e41729c8baba07664271bac50f983b4c348ec5d

See more details on using hashes here.

File details

Details for the file netpull-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: netpull-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for netpull-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aded1c13943817ea28909a64c9c8d1eb41b1ef2fab143d726e276f992e4b41bf
MD5 09f895c91c7f8fd4a0a45802aec8dbc4
BLAKE2b-256 cf9b42f6c15e493a12dd27fb1c69dec46fa76815c6dd27b9974330bb419927e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page