Skip to main content

A versatile web scraping library with configurable multi-strategy fetching and specialized handlers

Project description


ScrapeMaster

A powerful and versatile Python library for web scraping, designed to handle everything from simple static pages to complex, JavaScript-heavy websites with advanced anti-bot measures.



🚀 Overview

ScrapeMaster is a comprehensive Python library that simplifies the complexities of web scraping. It intelligently switches between multiple scraping strategies—from simple requests to browser automation with Selenium and undetected-chromedriver—to ensure you get the data you need, when you need it.

Whether you're extracting text, downloading images, converting articles to clean Markdown, crawling entire websites, or even fetching YouTube transcripts, ScrapeMaster provides a unified and powerful API to handle it all.

✨ Key Features

  • Multi-Strategy Scraping: Automatically tries different methods (requests, Selenium, undetected-chromedriver) to bypass anti-bot measures and handle JavaScript-rendered content.
  • Content-to-Markdown: Intelligently extracts the main content from a webpage, removes noise (like headers, footers, ads), and converts it into clean, readable Markdown.
  • Lightweight Document Parsing: Native support for scraping text from PDFs and DOCX files using pypdf and python-docx, with no heavy external dependencies.
  • YouTube Transcripts: Built-in support for fetching video transcripts (manual or auto-generated) via the youtube-transcript-api.
  • Comprehensive Data Extraction: Easily scrape text, images, and other structured data using CSS selectors.
  • Website Crawler: Recursively scrape an entire website by following links up to a specified depth, with domain restrictions to keep the crawl focused.
  • Anti-Bot Circumvention: Utilizes undetected-chromedriver and rotates user agents to appear more like a human user and avoid common blockers.
  • Session & Cookie Management: Persist sessions across requests by saving and loading cookies for both requests and Selenium.
  • Image Downloader: A built-in utility to download all scraped images to a local directory.
  • Robust Error Handling: Gracefully manages failures, providing clear feedback on which strategies failed and why.

📦 Installation

You can install ScrapeMaster directly from PyPI:

pip install ScrapeMaster

The library uses pipmaster to automatically manage and install its dependencies (like requests, selenium, youtube-transcript-api, etc.) upon first use, ensuring a smooth setup process.

Usage Examples

1. Simple Text and Image Scraping

Fetch a static page and extract all paragraph texts and image URLs.

from scrapemaster import ScrapeMaster

# Initialize with the target URL
scraper = ScrapeMaster('https://example.com')

# Scrape text from <p> tags and image URLs from <img> tags
results = scraper.scrape_all(
    text_selectors=['p'],
    image_selectors=['img']
)

if results:
    print("--- Texts ---")
    for text in results['texts']:
        print(f"- {text}")
        
    print("\n--- Image URLs ---")
    for url in results['image_urls']:
        print(f"- {url}")

2. Scraping a JavaScript-Rendered Page

ScrapeMaster will automatically switch to a browser-based strategy if requests fails or is blocked.

from scrapemaster import ScrapeMaster

# This URL likely requires JavaScript to load its content
url = "https://quotes.toscrape.com/js/"
scraper = ScrapeMaster(url)

# The 'auto' strategy will try requests, then selenium, then undetected
# to ensure content is loaded.
results = scraper.scrape_all(text_selectors=['.text', '.author'])

if results:
    for text in results['texts']:
        print(text)

print(f"\nSuccessfully used strategy: {scraper.last_strategy_used}")

3. Converting an Article to Clean Markdown

Extract the main content of a blog post or documentation page and save it as Markdown.

from scrapemaster import ScrapeMaster

url = "https://www.scrapethissite.com/pages/simple/"
scraper = ScrapeMaster(url)

# This method focuses on finding the main content and cleaning it
markdown_content = scraper.scrape_markdown()

if markdown_content:
    print(markdown_content)
    # You can save this to a file
    # with open('article.md', 'w', encoding='utf-8') as f:
    #     f.write(markdown_content)

4. Crawling a Website and Downloading Images

Crawl the first two levels of a website, aggregate all text, and download all found images.

from scrapemaster import ScrapeMaster

url = "https://blog.scrapinghub.com/"
scraper = ScrapeMaster(url)

# Crawl up to 1 level deep (start page + links on it)
# and download all images to 'scraped_images' directory.
results = scraper.scrape_all(
    max_depth=1,
    crawl_delay=1,  # 1-second delay between page requests
    download_images_output_dir='scraped_images'
)

if results:
    print(f"Successfully visited {len(results['visited_urls'])} pages.")
    print(f"Found {len(results['texts'])} text fragments.")
    print(f"Found and downloaded {len(results['image_urls'])} unique images.")

5. Scraping YouTube Transcripts

Retrieve transcripts from YouTube videos. You can list available languages and fetch the transcript text (preferring manually created ones over auto-generated).

from scrapemaster import ScrapeMaster

scraper = ScrapeMaster()
video_url = "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# 1. List available languages
languages = scraper.get_youtube_languages(video_url)
if languages:
    print("Available Languages:")
    for lang in languages:
        print(f"- {lang['code']}: {lang['name']} ({'Generated' if lang['is_generated'] else 'Manual'})")

# 2. Fetch the transcript (Auto-detects best available, or pass language_code='en')
transcript = scraper.scrape_youtube_transcript(video_url)

if transcript:
    print("\n--- Transcript Preview ---")
    print(transcript[:500] + "...") 

Core Concepts

ScrapeMaster's power comes from its layered, fallback-driven approach. When you request data, it follows a strategy order (default is ['requests', 'selenium', 'undetected']):

  1. Requests: The fastest method. It makes a simple HTTP GET request. If it receives a successful HTML response and doesn't detect a blocker, it succeeds.
  2. Selenium: If requests fails (e.g., due to a 403 error or a blocker page), ScrapeMaster launches a standard Selenium-controlled Chrome browser to render the page, executing JavaScript.
  3. Undetected-Chromedriver: If standard Selenium is also blocked, it escalates to undetected-chromedriver, which is patched to be much harder for services like Cloudflare to detect.

This "auto" mode ensures the highest chance of success with optimal performance. You can also force a specific strategy if you know what the target site requires.

🤝 Contributing

Contributions are welcome! If you have ideas for new features, bug fixes, or improvements, please feel free to:

  1. Open an issue to discuss the change.
  2. Fork the repository and create a new branch.
  3. Submit a pull request with a clear description of your changes.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

👤 Author

ScrapeMaster is developed and maintained by ParisNeo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemaster-0.8.0.tar.gz (39.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemaster-0.8.0-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file scrapemaster-0.8.0.tar.gz.

File metadata

  • Download URL: scrapemaster-0.8.0.tar.gz
  • Upload date:
  • Size: 39.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for scrapemaster-0.8.0.tar.gz
Algorithm Hash digest
SHA256 f19968f536ed3f2b42711c692a39b4627a385ac627001aeda2817fa11bedbcee
MD5 3c5064f7187dc7b8c35aca472790be71
BLAKE2b-256 86787265ae23a75e6eda1b7cb72e01486b0654ade04eab01359848a01949dd8b

See more details on using hashes here.

File details

Details for the file scrapemaster-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: scrapemaster-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for scrapemaster-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69ee8a58d3c17910de3b769be0562753827a8275d0123e42c65755391c4951e6
MD5 0c97b87be0e2358a188cb4d3ff8bc29e
BLAKE2b-256 43f1ef63f9a8cec21fe2d4897a350d2dc3a5a76445761ad041539ec3f991ae18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page