Skip to main content

A versatile web scraping library for extracting content from websites

Project description

WebScraper-Plus

A versatile web scraping library for extracting content from websites with robust error handling and customizable output options.

Features

  • Text Extraction: Extract and clean text content from web pages
  • Link Extraction: Find and save all hyperlinks from a page
  • Document Download: Download documents like PDF, DOCX, CSV, Excel files
  • Document Text Extraction: Extract text from downloaded documents
  • Image Download: Download images from web pages
  • OCR Capability: Extract text from images using Tesseract OCR
  • Configurable Output: Choose between flat or nested directory structure
  • Robust Error Handling: Gracefully handle network issues, missing dependencies, and more
  • Command-line Interface: Easy to use CLI for quick scraping tasks
  • Python API: Clean API for integration into your Python projects

Installation

Basic Installation

pip install webscraper-plus

With Document Support

pip install webscraper-plus[pdf,docx,excel]

With Image OCR Support

pip install webscraper-plus[ocr]

Full Installation (All Features)

pip install webscraper-plus[all]

Development Installation

pip install webscraper-plus[dev]

Dependencies

  • Core: requests, BeautifulSoup4, spacy, chardet
  • PDF: PyPDF2
  • DOCX: python-docx
  • Excel: openpyxl
  • OCR: pytesseract, Pillow (and Tesseract OCR system package)

Tesseract OCR

For OCR functionality, you need to install Tesseract OCR on your system:

  • Windows: Download installer from GitHub
  • macOS: brew install tesseract
  • Ubuntu/Debian: sudo apt-get install tesseract-ocr
  • Fedora/RHEL: sudo dnf install tesseract

Command-line Usage

# Basic usage (extracts all content types)
webscraper-plus --url https://example.com --output ./scraped_data

# Extract only text and links
webscraper-plus --url https://example.com --text --links

# Extract documents and images with flat structure
webscraper-plus --url https://example.com --docs --images --flat

# Extract all content types with verbose logging
webscraper-plus --url https://example.com --all --verbose

Command-line Options

  -h, --help       show this help message and exit
  --version        show program's version number and exit
  --url URL        URL to scrape
  --output OUTPUT  Output directory for scraped content
  --text           Extract webpage text
  --links          Extract hyperlinks
  --docs           Download documents
  --images         Download images
  --flat           Use flat directory structure
  --all            Enable all extraction options
  --verbose        Enable verbose logging

Python Usage

Basic Usage

from webscraper import WebScraper

# Create scraper (extracts everything)
scraper = WebScraper(
    url="https://example.com",
    base_output_dir="./scraped_data",
    extract_text=True,
    extract_links=True,
    extract_documents=True,
    extract_images=True
)

# Run scraper and get results
results = scraper.scrape()
print(results)

Selective Extraction

# Create scraper with selective extraction
scraper = WebScraper(
    url="https://example.com",
    extract_text=True,     # Only extract text
    extract_links=False,
    extract_documents=False,
    extract_images=False,
    flat_structure=True    # Use flat directory structure
)

# Run scraper
results = scraper.scrape()

Output Structure

By default, WebScraper-Plus uses a nested directory structure:

output_directory/
├── domain_timestamp/
│   ├── webpage_text.txt
│   ├── links/
│   │   └── links.txt
│   ├── documents/
│   │   ├── document1.pdf
│   │   ├── document1.txt (extracted text)
│   │   └── ...
│   └── images/
│       ├── image1.jpg
│       ├── image1.txt (OCR text)
│       └── ...

With flat_structure=True, all files are saved in a single directory with prefixes:

output_directory/
├── domain_timestamp/
│   ├── webpage_text_main.txt
│   ├── links_extracted.txt
│   ├── doc_document1.pdf
│   ├── doc_document1.txt
│   ├── img_image1.jpg
│   ├── img_ocr_image1.txt
│   └── ...

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Examples

The package includes several examples to help you get started:

Basic Usage Example

The examples/basic_usage.py script demonstrates how to use WebScraper with all extraction options enabled:

from webscraper import WebScraper

# Create WebScraper instance with all extraction options enabled
scraper = WebScraper(
    url="https://en.wikipedia.org/wiki/Web_scraping",
    base_output_dir="./scraped_output",
    extract_text=True,
    extract_links=True,
    extract_documents=True,
    extract_images=True
)

# Execute scraping
results = scraper.scrape()

Selective Scraping Example

The examples/selective_scraping.py script shows how to use WebScraper with specific extraction options:

# Example 1: Extract only text and links with flat directory structure
text_links_scraper = WebScraper(
    url="https://www.python.org",
    base_output_dir="./python_org_output",
    extract_text=True,
    extract_links=True,
    extract_documents=False,
    extract_images=False,
    flat_structure=True
)

# Example 2: Extract only images with nested directory structure
images_scraper = WebScraper(
    url="https://www.python.org",
    base_output_dir="./python_org_output",
    extract_text=False,
    extract_links=False,
    extract_documents=False,
    extract_images=True,
    flat_structure=False
)

Running Examples

To run the examples, clone the repository and run:

# Basic usage example
python examples/basic_usage.py

# Selective scraping example
python examples/selective_scraping.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscraper_plus-0.1.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webscraper_plus-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file webscraper_plus-0.1.0.tar.gz.

File metadata

  • Download URL: webscraper_plus-0.1.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for webscraper_plus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 752eab23d52ea337f159cc2b1ca5676460c390b55f006a73cc43a244e97bfc72
MD5 e78ef5363a3034edc6811beebb1e215c
BLAKE2b-256 3d4c52c36756c1711aa635491996cc3a4202078c60bbd1599fb12167591e01fd

See more details on using hashes here.

File details

Details for the file webscraper_plus-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for webscraper_plus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c9890d2f7a1d450db698483e5fb99ed964a529bfe2930b642130dee853d0e274
MD5 c3bee7e8c48cc0dda178788112cde3dc
BLAKE2b-256 dd3cd9eac8ab74f757539959116acdf42152306cdef96c87238dde6147d3c156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page