A versatile web scraping library with configurable multi-strategy fetching and specialized handlers
Project description
ScrapeMaster
A powerful and versatile Python library for web scraping, designed to handle everything from simple static pages to complex, JavaScript-heavy websites with advanced anti-bot measures.
🚀 Overview
ScrapeMaster is a comprehensive Python library that simplifies the complexities of web scraping. It intelligently switches between multiple scraping strategies—from simple requests to browser automation with Selenium and undetected-chromedriver—to ensure you get the data you need, when you need it.
Whether you're extracting text, downloading images, converting articles to clean Markdown, crawling entire websites, or even fetching YouTube transcripts, ScrapeMaster provides a unified and powerful API to handle it all.
✨ Key Features
- Multi-Strategy Scraping: Automatically tries different methods (
requests,Selenium,undetected-chromedriver) to bypass anti-bot measures and handle JavaScript-rendered content. - Content-to-Markdown: Intelligently extracts the main content from a webpage, removes noise (like headers, footers, ads), and converts it into clean, readable Markdown.
- Lightweight Document Parsing: Native support for scraping text from PDFs and DOCX files using
pypdfandpython-docx, with no heavy external dependencies. - YouTube Transcripts: Built-in support for fetching video transcripts (manual or auto-generated) via the
youtube-transcript-api. - Comprehensive Data Extraction: Easily scrape text, images, and other structured data using CSS selectors.
- Website Crawler: Recursively scrape an entire website by following links up to a specified depth, with domain restrictions to keep the crawl focused.
- Anti-Bot Circumvention: Utilizes
undetected-chromedriverand rotates user agents to appear more like a human user and avoid common blockers. - Session & Cookie Management: Persist sessions across requests by saving and loading cookies for both
requestsandSelenium. - Image Downloader: A built-in utility to download all scraped images to a local directory.
- Robust Error Handling: Gracefully manages failures, providing clear feedback on which strategies failed and why.
📦 Installation
You can install ScrapeMaster directly from PyPI:
pip install ScrapeMaster
The library uses pipmaster to automatically manage and install its dependencies (like requests, selenium, youtube-transcript-api, etc.) upon first use, ensuring a smooth setup process.
Usage Examples
1. Simple Text and Image Scraping
Fetch a static page and extract all paragraph texts and image URLs.
from scrapemaster import ScrapeMaster
# Initialize with the target URL
scraper = ScrapeMaster('https://example.com')
# Scrape text from <p> tags and image URLs from <img> tags
results = scraper.scrape_all(
text_selectors=['p'],
image_selectors=['img']
)
if results:
print("--- Texts ---")
for text in results['texts']:
print(f"- {text}")
print("\n--- Image URLs ---")
for url in results['image_urls']:
print(f"- {url}")
2. Scraping a JavaScript-Rendered Page
ScrapeMaster will automatically switch to a browser-based strategy if requests fails or is blocked.
from scrapemaster import ScrapeMaster
# This URL likely requires JavaScript to load its content
url = "https://quotes.toscrape.com/js/"
scraper = ScrapeMaster(url)
# The 'auto' strategy will try requests, then selenium, then undetected
# to ensure content is loaded.
results = scraper.scrape_all(text_selectors=['.text', '.author'])
if results:
for text in results['texts']:
print(text)
print(f"\nSuccessfully used strategy: {scraper.last_strategy_used}")
3. Converting an Article to Clean Markdown
Extract the main content of a blog post or documentation page and save it as Markdown.
from scrapemaster import ScrapeMaster
url = "https://www.scrapethissite.com/pages/simple/"
scraper = ScrapeMaster(url)
# This method focuses on finding the main content and cleaning it
markdown_content = scraper.scrape_markdown()
if markdown_content:
print(markdown_content)
# You can save this to a file
# with open('article.md', 'w', encoding='utf-8') as f:
# f.write(markdown_content)
4. Crawling a Website and Downloading Images
Crawl the first two levels of a website, aggregate all text, and download all found images.
from scrapemaster import ScrapeMaster
url = "https://blog.scrapinghub.com/"
scraper = ScrapeMaster(url)
# Crawl up to 1 level deep (start page + links on it)
# and download all images to 'scraped_images' directory.
results = scraper.scrape_all(
max_depth=1,
crawl_delay=1, # 1-second delay between page requests
download_images_output_dir='scraped_images'
)
if results:
print(f"Successfully visited {len(results['visited_urls'])} pages.")
print(f"Found {len(results['texts'])} text fragments.")
print(f"Found and downloaded {len(results['image_urls'])} unique images.")
5. Scraping YouTube Transcripts
Retrieve transcripts from YouTube videos. You can list available languages and fetch the transcript text (preferring manually created ones over auto-generated).
from scrapemaster import ScrapeMaster
scraper = ScrapeMaster()
video_url = "https://www.youtube.com/watch?v=jNQXAC9IVRw"
# 1. List available languages
languages = scraper.get_youtube_languages(video_url)
if languages:
print("Available Languages:")
for lang in languages:
print(f"- {lang['code']}: {lang['name']} ({'Generated' if lang['is_generated'] else 'Manual'})")
# 2. Fetch the transcript (Auto-detects best available, or pass language_code='en')
transcript = scraper.scrape_youtube_transcript(video_url)
if transcript:
print("\n--- Transcript Preview ---")
print(transcript[:500] + "...")
Core Concepts
ScrapeMaster's power comes from its layered, fallback-driven approach. When you request data, it follows a strategy order (default is ['requests', 'selenium', 'undetected']):
- Requests: The fastest method. It makes a simple HTTP GET request. If it receives a successful HTML response and doesn't detect a blocker, it succeeds.
- Selenium: If
requestsfails (e.g., due to a 403 error or a blocker page), ScrapeMaster launches a standard Selenium-controlled Chrome browser to render the page, executing JavaScript. - Undetected-Chromedriver: If standard Selenium is also blocked, it escalates to
undetected-chromedriver, which is patched to be much harder for services like Cloudflare to detect.
This "auto" mode ensures the highest chance of success with optimal performance. You can also force a specific strategy if you know what the target site requires.
🤝 Contributing
Contributions are welcome! If you have ideas for new features, bug fixes, or improvements, please feel free to:
- Open an issue to discuss the change.
- Fork the repository and create a new branch.
- Submit a pull request with a clear description of your changes.
📜 License
This project is licensed under the MIT License. See the LICENSE file for details.
👤 Author
ScrapeMaster is developed and maintained by ParisNeo.
- GitHub: @ParisNeo
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapemaster-0.8.0.tar.gz.
File metadata
- Download URL: scrapemaster-0.8.0.tar.gz
- Upload date:
- Size: 39.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f19968f536ed3f2b42711c692a39b4627a385ac627001aeda2817fa11bedbcee
|
|
| MD5 |
3c5064f7187dc7b8c35aca472790be71
|
|
| BLAKE2b-256 |
86787265ae23a75e6eda1b7cb72e01486b0654ade04eab01359848a01949dd8b
|
File details
Details for the file scrapemaster-0.8.0-py3-none-any.whl.
File metadata
- Download URL: scrapemaster-0.8.0-py3-none-any.whl
- Upload date:
- Size: 38.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69ee8a58d3c17910de3b769be0562753827a8275d0123e42c65755391c4951e6
|
|
| MD5 |
0c97b87be0e2358a188cb4d3ff8bc29e
|
|
| BLAKE2b-256 |
43f1ef63f9a8cec21fe2d4897a350d2dc3a5a76445761ad041539ec3f991ae18
|