A versatile web scraping library for extracting content from websites

These details have not been verified by PyPI

Project links

Homepage

Project description

WebScraper-Plus

A versatile web scraping library for extracting content from websites with robust error handling and customizable output options.

Features

Text Extraction: Extract and clean text content from web pages
Link Extraction: Find and save all hyperlinks from a page
Document Download: Download documents like PDF, DOCX, CSV, Excel files
Document Text Extraction: Extract text from downloaded documents
Image Download: Download images from web pages
OCR Capability: Extract text from images using Tesseract OCR
Configurable Output: Choose between flat or nested directory structure
Robust Error Handling: Gracefully handle network issues, missing dependencies, and more
Command-line Interface: Easy to use CLI for quick scraping tasks
Python API: Clean API for integration into your Python projects

Installation

Basic Installation

pip install webscraper-plus

With Document Support

pip install webscraper-plus[pdf,docx,excel]

With Image OCR Support

pip install webscraper-plus[ocr]

Full Installation (All Features)

pip install webscraper-plus[all]

Development Installation

pip install webscraper-plus[dev]

Dependencies

Core: requests, BeautifulSoup4, spacy, chardet
PDF: PyPDF2
DOCX: python-docx
Excel: openpyxl
OCR: pytesseract, Pillow (and Tesseract OCR system package)

Tesseract OCR

For OCR functionality, you need to install Tesseract OCR on your system:

Windows: Download installer from GitHub
macOS: brew install tesseract
Ubuntu/Debian: sudo apt-get install tesseract-ocr
Fedora/RHEL: sudo dnf install tesseract

Command-line Usage

# Basic usage (extracts all content types)
webscraper-plus --url https://example.com --output ./scraped_data

# Extract only text and links
webscraper-plus --url https://example.com --text --links

# Extract documents and images with flat structure
webscraper-plus --url https://example.com --docs --images --flat

# Extract all content types with verbose logging
webscraper-plus --url https://example.com --all --verbose

Command-line Options

  -h, --help       show this help message and exit
  --version        show program's version number and exit
  --url URL        URL to scrape
  --output OUTPUT  Output directory for scraped content
  --text           Extract webpage text
  --links          Extract hyperlinks
  --docs           Download documents
  --images         Download images
  --flat           Use flat directory structure
  --all            Enable all extraction options
  --verbose        Enable verbose logging

Python Usage

Basic Usage

from webscraper import WebScraper

# Create scraper (extracts everything)
scraper = WebScraper(
    url="https://example.com",
    base_output_dir="./scraped_data",
    extract_text=True,
    extract_links=True,
    extract_documents=True,
    extract_images=True
)

# Run scraper and get results
results = scraper.scrape()
print(results)

Selective Extraction

# Create scraper with selective extraction
scraper = WebScraper(
    url="https://example.com",
    extract_text=True,     # Only extract text
    extract_links=False,
    extract_documents=False,
    extract_images=False,
    flat_structure=True    # Use flat directory structure
)

# Run scraper
results = scraper.scrape()

Output Structure

By default, WebScraper-Plus uses a nested directory structure:

output_directory/
├── domain_timestamp/
│   ├── webpage_text.txt
│   ├── links/
│   │   └── links.txt
│   ├── documents/
│   │   ├── document1.pdf
│   │   ├── document1.txt (extracted text)
│   │   └── ...
│   └── images/
│       ├── image1.jpg
│       ├── image1.txt (OCR text)
│       └── ...

With flat_structure=True, all files are saved in a single directory with prefixes:

output_directory/
├── domain_timestamp/
│   ├── webpage_text_main.txt
│   ├── links_extracted.txt
│   ├── doc_document1.pdf
│   ├── doc_document1.txt
│   ├── img_image1.jpg
│   ├── img_ocr_image1.txt
│   └── ...

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Examples

The package includes several examples to help you get started:

Basic Usage Example

The examples/basic_usage.py script demonstrates how to use WebScraper with all extraction options enabled:

from webscraper import WebScraper

# Create WebScraper instance with all extraction options enabled
scraper = WebScraper(
    url="https://en.wikipedia.org/wiki/Web_scraping",
    base_output_dir="./scraped_output",
    extract_text=True,
    extract_links=True,
    extract_documents=True,
    extract_images=True
)

# Execute scraping
results = scraper.scrape()

Selective Scraping Example

The examples/selective_scraping.py script shows how to use WebScraper with specific extraction options:

# Example 1: Extract only text and links with flat directory structure
text_links_scraper = WebScraper(
    url="https://www.python.org",
    base_output_dir="./python_org_output",
    extract_text=True,
    extract_links=True,
    extract_documents=False,
    extract_images=False,
    flat_structure=True
)

# Example 2: Extract only images with nested directory structure
images_scraper = WebScraper(
    url="https://www.python.org",
    base_output_dir="./python_org_output",
    extract_text=False,
    extract_links=False,
    extract_documents=False,
    extract_images=True,
    flat_structure=False
)

Running Examples

To run the examples, clone the repository and run:

# Basic usage example
python examples/basic_usage.py

# Selective scraping example
python examples/selective_scraping.py

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

May 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscraper_plus-0.1.0.tar.gz (18.6 kB view details)

Uploaded May 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webscraper_plus-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded May 9, 2025 Python 3

File details

Details for the file webscraper_plus-0.1.0.tar.gz.

File metadata

Download URL: webscraper_plus-0.1.0.tar.gz
Upload date: May 9, 2025
Size: 18.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for webscraper_plus-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`752eab23d52ea337f159cc2b1ca5676460c390b55f006a73cc43a244e97bfc72`
MD5	`e78ef5363a3034edc6811beebb1e215c`
BLAKE2b-256	`3d4c52c36756c1711aa635491996cc3a4202078c60bbd1599fb12167591e01fd`

See more details on using hashes here.

File details

Details for the file webscraper_plus-0.1.0-py3-none-any.whl.

File metadata

Download URL: webscraper_plus-0.1.0-py3-none-any.whl
Upload date: May 9, 2025
Size: 16.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for webscraper_plus-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c9890d2f7a1d450db698483e5fb99ed964a529bfe2930b642130dee853d0e274`
MD5	`c3bee7e8c48cc0dda178788112cde3dc`
BLAKE2b-256	`dd3cd9eac8ab74f757539959116acdf42152306cdef96c87238dde6147d3c156`

See more details on using hashes here.

webscraper-plus 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebScraper-Plus

Features

Installation

Basic Installation

With Document Support

With Image OCR Support

Full Installation (All Features)

Development Installation

Dependencies

Tesseract OCR

Command-line Usage

Command-line Options

Python Usage

Basic Usage

Selective Extraction

Output Structure

License

Contributing

Examples

Basic Usage Example

Selective Scraping Example

Running Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes