A versatile web scraping library for extracting content from websites
Project description
WebScraper-Plus
A versatile web scraping library for extracting content from websites with robust error handling and customizable output options.
Features
- Text Extraction: Extract and clean text content from web pages
- Link Extraction: Find and save all hyperlinks from a page
- Document Download: Download documents like PDF, DOCX, CSV, Excel files
- Document Text Extraction: Extract text from downloaded documents
- Image Download: Download images from web pages
- OCR Capability: Extract text from images using Tesseract OCR
- Configurable Output: Choose between flat or nested directory structure
- Robust Error Handling: Gracefully handle network issues, missing dependencies, and more
- Command-line Interface: Easy to use CLI for quick scraping tasks
- Python API: Clean API for integration into your Python projects
Installation
Basic Installation
pip install webscraper-plus
With Document Support
pip install webscraper-plus[pdf,docx,excel]
With Image OCR Support
pip install webscraper-plus[ocr]
Full Installation (All Features)
pip install webscraper-plus[all]
Development Installation
pip install webscraper-plus[dev]
Dependencies
- Core: requests, BeautifulSoup4, spacy, chardet
- PDF: PyPDF2
- DOCX: python-docx
- Excel: openpyxl
- OCR: pytesseract, Pillow (and Tesseract OCR system package)
Tesseract OCR
For OCR functionality, you need to install Tesseract OCR on your system:
- Windows: Download installer from GitHub
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Fedora/RHEL:
sudo dnf install tesseract
Command-line Usage
# Basic usage (extracts all content types)
webscraper-plus --url https://example.com --output ./scraped_data
# Extract only text and links
webscraper-plus --url https://example.com --text --links
# Extract documents and images with flat structure
webscraper-plus --url https://example.com --docs --images --flat
# Extract all content types with verbose logging
webscraper-plus --url https://example.com --all --verbose
Command-line Options
-h, --help show this help message and exit
--version show program's version number and exit
--url URL URL to scrape
--output OUTPUT Output directory for scraped content
--text Extract webpage text
--links Extract hyperlinks
--docs Download documents
--images Download images
--flat Use flat directory structure
--all Enable all extraction options
--verbose Enable verbose logging
Python Usage
Basic Usage
from webscraper import WebScraper
# Create scraper (extracts everything)
scraper = WebScraper(
url="https://example.com",
base_output_dir="./scraped_data",
extract_text=True,
extract_links=True,
extract_documents=True,
extract_images=True
)
# Run scraper and get results
results = scraper.scrape()
print(results)
Selective Extraction
# Create scraper with selective extraction
scraper = WebScraper(
url="https://example.com",
extract_text=True, # Only extract text
extract_links=False,
extract_documents=False,
extract_images=False,
flat_structure=True # Use flat directory structure
)
# Run scraper
results = scraper.scrape()
Output Structure
By default, WebScraper-Plus uses a nested directory structure:
output_directory/
├── domain_timestamp/
│ ├── webpage_text.txt
│ ├── links/
│ │ └── links.txt
│ ├── documents/
│ │ ├── document1.pdf
│ │ ├── document1.txt (extracted text)
│ │ └── ...
│ └── images/
│ ├── image1.jpg
│ ├── image1.txt (OCR text)
│ └── ...
With flat_structure=True, all files are saved in a single directory with prefixes:
output_directory/
├── domain_timestamp/
│ ├── webpage_text_main.txt
│ ├── links_extracted.txt
│ ├── doc_document1.pdf
│ ├── doc_document1.txt
│ ├── img_image1.jpg
│ ├── img_ocr_image1.txt
│ └── ...
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Examples
The package includes several examples to help you get started:
Basic Usage Example
The examples/basic_usage.py script demonstrates how to use WebScraper with all extraction options enabled:
from webscraper import WebScraper
# Create WebScraper instance with all extraction options enabled
scraper = WebScraper(
url="https://en.wikipedia.org/wiki/Web_scraping",
base_output_dir="./scraped_output",
extract_text=True,
extract_links=True,
extract_documents=True,
extract_images=True
)
# Execute scraping
results = scraper.scrape()
Selective Scraping Example
The examples/selective_scraping.py script shows how to use WebScraper with specific extraction options:
# Example 1: Extract only text and links with flat directory structure
text_links_scraper = WebScraper(
url="https://www.python.org",
base_output_dir="./python_org_output",
extract_text=True,
extract_links=True,
extract_documents=False,
extract_images=False,
flat_structure=True
)
# Example 2: Extract only images with nested directory structure
images_scraper = WebScraper(
url="https://www.python.org",
base_output_dir="./python_org_output",
extract_text=False,
extract_links=False,
extract_documents=False,
extract_images=True,
flat_structure=False
)
Running Examples
To run the examples, clone the repository and run:
# Basic usage example
python examples/basic_usage.py
# Selective scraping example
python examples/selective_scraping.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webscraper_plus-0.1.0.tar.gz.
File metadata
- Download URL: webscraper_plus-0.1.0.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
752eab23d52ea337f159cc2b1ca5676460c390b55f006a73cc43a244e97bfc72
|
|
| MD5 |
e78ef5363a3034edc6811beebb1e215c
|
|
| BLAKE2b-256 |
3d4c52c36756c1711aa635491996cc3a4202078c60bbd1599fb12167591e01fd
|
File details
Details for the file webscraper_plus-0.1.0-py3-none-any.whl.
File metadata
- Download URL: webscraper_plus-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9890d2f7a1d450db698483e5fb99ed964a529bfe2930b642130dee853d0e274
|
|
| MD5 |
c3bee7e8c48cc0dda178788112cde3dc
|
|
| BLAKE2b-256 |
dd3cd9eac8ab74f757539959116acdf42152306cdef96c87238dde6147d3c156
|