Web content extraction with screenshots and structured data
Project description
NetPull
Web content extraction with screenshots and structured data
NetPull is a Python package for extracting web content including full-page screenshots, HTML, and structured data (headings, paragraphs, links, tables, forms, metadata). Built on Playwright for reliable cross-browser automation.
Features
- 📸 Full-page screenshots - Capture entire webpages including lazy-loaded content
- 📄 HTML extraction - Save cleaned HTML without scripts/styles
- 🔍 Structured data - Extract titles, headings, paragraphs, and links
- 📊 Advanced extraction - Tables, forms, images, and metadata (OpenGraph, Twitter Cards)
- 🌐 Multi-browser - Firefox, Chrome, and WebKit support
- 🍪 Cookie consent - Automatic handling of cookie popups
- ⚡ Async support - Both synchronous and asynchronous APIs
- 🔄 Batch processing - Extract multiple URLs with concurrency control
- 🎯 CLI & Library - Use as command-line tool or Python library
Installation
pip install netpull
After installation, install Playwright browsers:
playwright install firefox
# or
playwright install chrome webkit
Quick Start
Command Line
# Extract single URL
netpull https://example.com
# Extract with all features
netpull https://example.com --extract-all
# Batch extraction from file
netpull -f urls.txt --browser chrome --concurrency 5
# Custom output directory
netpull https://example.com -o ./my-output --filename-pattern {domain}_{date}
Python Library
from netpull import extract_webpage
# Basic extraction
result = extract_webpage('https://example.com')
print(result.screenshot_path) # ./extracted/example_com_20260104_143022.png
print(result.structured_data['title']) # Page title
Async Usage
import asyncio
from netpull import extract_webpage_async
async def main():
result = await extract_webpage_async('https://example.com')
print(result.structured_data)
asyncio.run(main())
Usage Examples
Advanced Configuration
from netpull import extract_webpage, ExtractionConfig, BrowserConfig
from pathlib import Path
# Configure browser
browser_config = BrowserConfig(
browser_type='chrome',
headless=True,
timeout=60000 # 60 seconds
)
# Configure extraction
extraction_config = ExtractionConfig(
output_dir=Path('./output'),
filename_pattern='{domain}_{timestamp}',
extract_images=True,
extract_tables=True,
extract_forms=True,
extract_metadata=True,
scroll_to_bottom=True,
handle_cookie_consent=True
)
result = extract_webpage(
'https://example.com',
extraction_config=extraction_config,
browser_config=browser_config
)
Batch Processing
from netpull import extract_batch
urls = [
'https://example.com',
'https://github.com',
'https://python.org'
]
def progress(current, total, url):
print(f"[{current}/{total}] Processing {url}")
results = extract_batch(urls, progress_callback=progress)
# Check results
for result in results:
if result.success:
print(f"✓ {result.url}: {result.screenshot_path}")
else:
print(f"✗ {result.url}: {result.error}")
Extract Specific Content
from netpull import extract_webpage, ExtractionConfig
config = ExtractionConfig(
extract_images=True,
extract_tables=True,
extract_metadata=True
)
result = extract_webpage('https://example.com', extraction_config=config)
# Access extracted data
print(f"Found {len(result.images)} images")
print(f"Found {len(result.tables)} tables")
print(f"OpenGraph title: {result.metadata['opengraph'].get('og:title', 'N/A')}")
CLI Reference
Basic Usage
netpull URL [URL ...]
netpull -f FILE
Browser Options
--browser {firefox,chrome,webkit}- Browser to use (default: firefox)--headless/--no-headless- Headless mode (default: headless)--timeout SECONDS- Navigation timeout (default: 30)--user-agent STRING- Custom user agent
Output Options
-o DIR/--output-dir DIR- Output directory (default: ./extracted)--filename-pattern PATTERN- Filename pattern (default: {domain}_{timestamp})--output-format {text,json}- CLI output format
Extraction Options
--extract-images- Extract images--extract-tables- Extract tables--extract-forms- Extract forms--extract-metadata- Extract metadata--extract-all- Enable all extraction features--no-screenshot- Disable screenshot--no-html- Disable HTML saving
Navigation Options
--wait-for-selector SELECTOR- Wait for CSS selector--wait-for-timeout MS- Wait timeout (default: 1000)--no-scroll- Disable scroll to bottom--no-cookie-consent- Disable cookie consent handling
Batch Options
--concurrency N- Max concurrent extractions (default: 3)--delay SECONDS- Delay between requests (default: 0)
Other Options
--retry N- Retry count on failure (default: 0)--retry-delay SECONDS- Delay between retries (default: 5)-v/--verbose- Verbose output--version- Show version
Filename Patterns
Use tokens in --filename-pattern:
{domain}- Domain name (example_com){timestamp}- Current timestamp (20260104_143022){url_hash}- MD5 hash of URL (first 8 chars){date}- Current date (2026-01-04){time}- Current time (143022)
Example:
netpull https://example.com --filename-pattern "{domain}_{date}"
# Output: example_com_2026-01-04.png
Configuration Classes
BrowserConfig
from netpull import BrowserConfig
config = BrowserConfig(
browser_type='firefox', # 'firefox', 'chrome', or 'webkit'
headless=True, # Run without GUI
timeout=30000, # Navigation timeout (ms)
viewport_width=1920, # Browser width
viewport_height=1080, # Browser height
user_agent=None # Custom user agent
)
ExtractionConfig
from netpull import ExtractionConfig
from pathlib import Path
config = ExtractionConfig(
# Output
output_dir=Path('./extracted'),
filename_pattern='{domain}_{timestamp}',
# Extraction toggles
extract_screenshot=True,
extract_html=True,
extract_images=False,
extract_tables=False,
extract_forms=False,
extract_metadata=False,
# Navigation
wait_for_networkidle=True,
wait_for_timeout=1000,
wait_for_selector=None,
scroll_to_bottom=True,
# Cookie consent
handle_cookie_consent=True,
cookie_consent_timeout=2000,
# Batch
batch_concurrency=3,
batch_delay=0
)
Result Object
The ExtractionResult object contains:
result = extract_webpage('https://example.com')
result.url # URL that was extracted
result.success # True if successful, False otherwise
result.error # Error message if failed
result.screenshot_path # Path to screenshot (Path object)
result.html_path # Path to HTML file (Path object)
result.structured_data # Dict with title, main_content, links
result.images # List of image data
result.tables # List of table data
result.forms # List of form data
result.metadata # Dict with opengraph, twitter, json_ld
# Convert to dictionary for JSON
result.to_dict()
# String representation
print(result)
# ✓ https://example.com
# Screenshot: ./extracted/example_com_20260104_143022.png
# HTML: ./extracted/example_com_20260104_143022.html
Troubleshooting
Playwright browsers not installed
playwright install firefox
# or install all browsers
playwright install
Timeout errors
Increase timeout in browser config:
browser_config = BrowserConfig(timeout=60000) # 60 seconds
Or via CLI:
netpull https://example.com --timeout 60
Cookie consent not handled
The package tries 11 common selectors. For sites with unusual cookie popups, disable auto-handling:
netpull https://example.com --no-cookie-consent
Memory issues with large batches
Reduce concurrency:
netpull -f urls.txt --concurrency 1
Development
Install for development
git clone https://github.com/netpull/netpull.git
cd netpull
pip install -e ".[dev]"
playwright install firefox
Run tests
pytest tests/ -v
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Credits
Built with:
- Playwright - Browser automation
- BeautifulSoup - HTML parsing
NetPull - Simple, powerful web content extraction
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file netpull-0.1.0.tar.gz.
File metadata
- Download URL: netpull-0.1.0.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12c4f2d189566098d9fdd43aadef2fb1a4952a87f8ef7073bbec52c5a584f585
|
|
| MD5 |
23e6084e639be3270a7e809d3ac748a3
|
|
| BLAKE2b-256 |
349b90ffddb66da0396415610e41729c8baba07664271bac50f983b4c348ec5d
|
File details
Details for the file netpull-0.1.0-py3-none-any.whl.
File metadata
- Download URL: netpull-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aded1c13943817ea28909a64c9c8d1eb41b1ef2fab143d726e276f992e4b41bf
|
|
| MD5 |
09f895c91c7f8fd4a0a45802aec8dbc4
|
|
| BLAKE2b-256 |
cf9b42f6c15e493a12dd27fb1c69dec46fa76815c6dd27b9974330bb419927e1
|