A versatile web scraping library with configurable multi-strategy fetching and specialized handlers

These details have not been verified by PyPI

Project links

Project description

ScrapeMaster

A powerful and versatile Python library for web scraping, designed to handle everything from simple static pages to complex, JavaScript-heavy websites with advanced anti-bot measures.

🚀 Overview

ScrapeMaster is a comprehensive Python library that simplifies the complexities of web scraping. It intelligently switches between multiple scraping strategies—from simple requests to browser automation with Selenium and undetected-chromedriver—to ensure you get the data you need, when you need it.

Whether you're extracting text, downloading images, converting articles to clean Markdown, crawling entire websites, or even fetching YouTube transcripts, ScrapeMaster provides a unified and powerful API to handle it all.

✨ Key Features

Multi-Strategy Scraping: Automatically tries different methods (requests, Selenium, undetected-chromedriver) to bypass anti-bot measures and handle JavaScript-rendered content.
Content-to-Markdown: Intelligently extracts the main content from a webpage, removes noise (like headers, footers, ads), and converts it into clean, readable Markdown.
Lightweight Document Parsing: Native support for scraping text from PDFs and DOCX files using pypdf and python-docx, with no heavy external dependencies.
YouTube Transcripts: Built-in support for fetching video transcripts (manual or auto-generated) via the youtube-transcript-api.
Comprehensive Data Extraction: Easily scrape text, images, and other structured data using CSS selectors.
Website Crawler: Recursively scrape an entire website by following links up to a specified depth, with domain restrictions to keep the crawl focused.
Anti-Bot Circumvention: Utilizes undetected-chromedriver and rotates user agents to appear more like a human user and avoid common blockers.
Session & Cookie Management: Persist sessions across requests by saving and loading cookies for both requests and Selenium.
Image Downloader: A built-in utility to download all scraped images to a local directory.
Robust Error Handling: Gracefully manages failures, providing clear feedback on which strategies failed and why.

📦 Installation

You can install ScrapeMaster directly from PyPI:

pip install ScrapeMaster

The library uses pipmaster to automatically manage and install its dependencies (like requests, selenium, youtube-transcript-api, etc.) upon first use, ensuring a smooth setup process.

Usage Examples

1. Simple Text and Image Scraping

Fetch a static page and extract all paragraph texts and image URLs.

from scrapemaster import ScrapeMaster

# Initialize with the target URL
scraper = ScrapeMaster('https://example.com')

# Scrape text from <p> tags and image URLs from <img> tags
results = scraper.scrape_all(
    text_selectors=['p'],
    image_selectors=['img']
)

if results:
    print("--- Texts ---")
    for text in results['texts']:
        print(f"- {text}")
        
    print("\n--- Image URLs ---")
    for url in results['image_urls']:
        print(f"- {url}")

2. Scraping a JavaScript-Rendered Page

ScrapeMaster will automatically switch to a browser-based strategy if requests fails or is blocked.

from scrapemaster import ScrapeMaster

# This URL likely requires JavaScript to load its content
url = "https://quotes.toscrape.com/js/"
scraper = ScrapeMaster(url)

# The 'auto' strategy will try requests, then selenium, then undetected
# to ensure content is loaded.
results = scraper.scrape_all(text_selectors=['.text', '.author'])

if results:
    for text in results['texts']:
        print(text)

print(f"\nSuccessfully used strategy: {scraper.last_strategy_used}")

3. Converting an Article to Clean Markdown

Extract the main content of a blog post or documentation page and save it as Markdown.

from scrapemaster import ScrapeMaster

url = "https://www.scrapethissite.com/pages/simple/"
scraper = ScrapeMaster(url)

# This method focuses on finding the main content and cleaning it
markdown_content = scraper.scrape_markdown()

if markdown_content:
    print(markdown_content)
    # You can save this to a file
    # with open('article.md', 'w', encoding='utf-8') as f:
    #     f.write(markdown_content)

4. Crawling a Website and Downloading Images

Crawl the first two levels of a website, aggregate all text, and download all found images.

from scrapemaster import ScrapeMaster

url = "https://blog.scrapinghub.com/"
scraper = ScrapeMaster(url)

# Crawl up to 1 level deep (start page + links on it)
# and download all images to 'scraped_images' directory.
results = scraper.scrape_all(
    max_depth=1,
    crawl_delay=1,  # 1-second delay between page requests
    download_images_output_dir='scraped_images'
)

if results:
    print(f"Successfully visited {len(results['visited_urls'])} pages.")
    print(f"Found {len(results['texts'])} text fragments.")
    print(f"Found and downloaded {len(results['image_urls'])} unique images.")

5. Scraping YouTube Transcripts

Retrieve transcripts from YouTube videos. You can list available languages and fetch the transcript text (preferring manually created ones over auto-generated).

from scrapemaster import ScrapeMaster

scraper = ScrapeMaster()
video_url = "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# 1. List available languages
languages = scraper.get_youtube_languages(video_url)
if languages:
    print("Available Languages:")
    for lang in languages:
        print(f"- {lang['code']}: {lang['name']} ({'Generated' if lang['is_generated'] else 'Manual'})")

# 2. Fetch the transcript (Auto-detects best available, or pass language_code='en')
transcript = scraper.scrape_youtube_transcript(video_url)

if transcript:
    print("\n--- Transcript Preview ---")
    print(transcript[:500] + "...")

Core Concepts

ScrapeMaster's power comes from its layered, fallback-driven approach. When you request data, it follows a strategy order (default is ['requests', 'selenium', 'undetected']):

Requests: The fastest method. It makes a simple HTTP GET request. If it receives a successful HTML response and doesn't detect a blocker, it succeeds.
Selenium: If requests fails (e.g., due to a 403 error or a blocker page), ScrapeMaster launches a standard Selenium-controlled Chrome browser to render the page, executing JavaScript.
Undetected-Chromedriver: If standard Selenium is also blocked, it escalates to undetected-chromedriver, which is patched to be much harder for services like Cloudflare to detect.

This "auto" mode ensures the highest chance of success with optimal performance. You can also force a specific strategy if you know what the target site requires.

🤝 Contributing

Contributions are welcome! If you have ideas for new features, bug fixes, or improvements, please feel free to:

Open an issue to discuss the change.
Fork the repository and create a new branch.
Submit a pull request with a clear description of your changes.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

👤 Author

ScrapeMaster is developed and maintained by ParisNeo.

GitHub: @ParisNeo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.0

Feb 17, 2026

0.6.0

Jan 18, 2026

0.4.5

Dec 27, 2025

0.4.4

Dec 12, 2025

0.4.3

Nov 13, 2025

0.4.2

Apr 28, 2025

0.4.1

Apr 28, 2025

0.4.0

Apr 28, 2025

0.2.2

Oct 7, 2024

0.2.1

Oct 7, 2024

0.2.0

Aug 31, 2024

0.1.6

Aug 5, 2024

0.1.5

Jul 26, 2024

0.1.4

Jul 22, 2024

0.1.1

Jul 20, 2024

0.1.0

Jul 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemaster-0.8.0.tar.gz (39.5 kB view details)

Uploaded Feb 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapemaster-0.8.0-py3-none-any.whl (38.3 kB view details)

Uploaded Feb 17, 2026 Python 3

File details

Details for the file scrapemaster-0.8.0.tar.gz.

File metadata

Download URL: scrapemaster-0.8.0.tar.gz
Upload date: Feb 17, 2026
Size: 39.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for scrapemaster-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`f19968f536ed3f2b42711c692a39b4627a385ac627001aeda2817fa11bedbcee`
MD5	`3c5064f7187dc7b8c35aca472790be71`
BLAKE2b-256	`86787265ae23a75e6eda1b7cb72e01486b0654ade04eab01359848a01949dd8b`

See more details on using hashes here.

File details

Details for the file scrapemaster-0.8.0-py3-none-any.whl.

File metadata

Download URL: scrapemaster-0.8.0-py3-none-any.whl
Upload date: Feb 17, 2026
Size: 38.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for scrapemaster-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`69ee8a58d3c17910de3b769be0562753827a8275d0123e42c65755391c4951e6`
MD5	`0c97b87be0e2358a188cb4d3ff8bc29e`
BLAKE2b-256	`43f1ef63f9a8cec21fe2d4897a350d2dc3a5a76445761ad041539ec3f991ae18`

See more details on using hashes here.

scrapemaster 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScrapeMaster

🚀 Overview

✨ Key Features

📦 Installation

Usage Examples

1. Simple Text and Image Scraping

2. Scraping a JavaScript-Rendered Page

3. Converting an Article to Clean Markdown

4. Crawling a Website and Downloading Images

5. Scraping YouTube Transcripts

Core Concepts

🤝 Contributing

📜 License

👤 Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes