A Powerful Web Scraper with dynamic rendering support.
Project description
ScrapeSome
ScrapeSome is a lightweight, flexible web scraping library with both synchronous and asynchronous support. It includes intelligent fallbacks, JavaScript page rendering, response formatting (HTML โ Text/JSON/Markdown), and retry mechanisms. Ideal for developers who need robust scraping utilities with minimal setup.
Table of Contents
- ๐ก Why Use ScrapeSome?
- ๐ Features
- โ Comparison with Alternatives
- ๐ฆ Installation
- Playwright Setup
- โก Quick Start
- ๐ฅ๏ธ CLI Usage
- ๐งฐ Advanced Usage
- ๐งช Testing
- โ๏ธ Environment Configuration
- ๐ Output Formats
- ๐ Project Structure
- ๐ License
- ๐ค Contributions
๐ก Why Use ScrapeSome?
- Handles both static and JS-heavy pages out of the box
- Supports both sync and async scraping
- Converts raw HTML into clean text, JSON, or Markdown
- Works with minimal configuration (
pip install scrapesome) - Handles timeouts, retries, redirects, user agents
๐ Features
- ๐ Sync + Async scraping support
- ๐ Automatic retries and intelligent fallbacks
- ๐งช Playwright rendering fallback for JS-heavy pages
- ๐ Format responses as raw HTML, plain text, Markdown, or structured JSON
- โ๏ธ Configurable: timeouts, redirects, user agents, and logging
- ๐งช Test coverage with
pytestandpytest-asyncio
โ Comparison with Alternatives
| Feature | ScrapeSome โ | Playwright (Python) | Selenium + UC | Requests-HTML | Scrapy + Playwright |
|---|---|---|---|---|---|
| ๐ง JS Rendering Support | โ Auto fallback on 403/JS content | โ Always (manual control) | โ Always (manual control) | โ Partial (via Pyppeteer) | โ Requires setup |
| ๐ Automatic Fallback (403/Blank) | โ Yes (seamless) | โ Manual logic needed | โ Manual logic needed | โ No | โ Needs per-request config |
| ๐ Uses Browser Engine | โ Only when needed (Playwright) | โ Always | โ Always | โ (Unstable, slow) | โ Always (if enabled) |
| โ Sync + Async Support | โ Built-in | โ Async only | โ Manual (via threading) | โ Sync only | โ Async only (via plugin) |
| ๐ JSON/Markdown/HTML Output | โ Built-in formats | โ Manual parsing | โ Manual parsing | โ Basic only | โ Custom pipeline needed |
| โก Minimal Setup | โ Near zero | โ Code + browser install | โ Driver + setup | โ Simple pip install | โ Complex + plugin setup |
| ๐ Retries, Timeouts, Agents | โ Smart defaults built-in | โ Manual handling | โ Manual handling | โ Limited | โ ๏ธ Partial via settings |
| ๐งช Pytest-Ready Out-of-the-box | โ Fully testable | โ ๏ธ Requires mocks | โ Hard to test | โ Minimal | โ ๏ธ Needs testing harness |
| โ๏ธ Config via .env / Inline | โ Flexible and optional | โ Code/config only | โ Manual via code | โ Hardcoded mostly | โ ๏ธ Project settings |
| ๐ฆ Install & Run in <1 Min | โ Yes | โ Setup required | โ Driver + config needed | โ Yes | โ Needs project + plugin |
๐ฆ Installation
pip install scrapesome
Playwright Setup
ScrapeSome uses Playwright for JavaScript rendering fallback. To enable this, you need to install Playwright and its dependencies.
1. Install Playwright Python package if not installed
pip install playwright
2. Install Playwright browsers
playwright install
3. Install system dependencies
Playwright requires some system libraries to run browsers, which vary by operating system.
For Windows Playwright installs everything you need automatically with playwright install, so no additional setup is usually required.
For Linux (Ubuntu/Debian) Run the following command to install required system libraries:
playwright install-deps
If you don't have playwright CLI available, you can install dependencies manually:
sudo apt-get update
sudo apt-get install -y libwoff1 libopus0 libwebp6 libharfbuzz-icu0 libwebpmux3 \
libenchant-2-2 libhyphen0 libegl1 libglx0 libgudev-1.0-0 \
libevdev2 libgles2 libx264-160
Note: Package names may vary depending on your distribution and version.
For macOS You can install required libraries using Homebrew:
brew install harfbuzz enchant
After this setup, you should be able to use ScrapeSome with full Playwright rendering support!
โก Quick Start
Synchronous Example
from scrapesome import sync_scraper
html = sync_scraper("https://example.com")
html
Asynchronous Example
import asyncio
from scrapesome import async_scraper
html = asyncio.run(async_scraper("https://example.com"))
html
๐ฅ๏ธ CLI Usage
ScrapeSome also includes a powerful CLI for quick and easy scraping from the command line.
๐ฆ Installation with CLI Support
To use the CLI, install with the optional cli extras:
pip install scrapesome[cli]
๐ง Basic Usage
scrapesome scrape --url https://example.com
This performs a synchronous scrape and outputs plain text by default.
โ๏ธ Available Options
| Option | Description | Default |
|---|---|---|
--async-mode |
Use asynchronous scraping | False |
--force-playwright |
Force JavaScript rendering using Playwright | False |
--output-format |
Choose text, json, markdown, or html |
html |
Examples
Basic scrape
scrapesome scrape --url https://example.com
Force Playwright rendering
scrapesome scrape --url https://example.com --force-playwright
Get JSON output
scrapesome scrape --url https://example.com --output-format json
Async scrape with markdown output
scrapesome scrape --url https://example.com --async-mode --output-format markdown
๐ File Saving
ScrapeSome allows you to format and save your scraped content with zero hassleโboth via the CLI and in Python code.
๐ป Save Output to File
Use these flags to save your output directly from the command line:
--save-to-fileor-s: Enable saving to a file--file-nameor-n: Desired filename (extension added automatically)--output-formator-f: One ofhtml,text,markdown, orjson
โ ๏ธ Note: When saving to a file, only one URL can be scraped at a time.
๐ฆ Example:
scrapesome scrape "https://example.com" \
--output-format markdown \
--save-to-file \
--file-name output
๐ This saves the result as output.md.
Save Output in Code
The sync_scraper function supports saving to file using two optional flags:
save_to_file=True: Enables savingfile_name="your_file_name": Sets the base filename (extension inferred from format)
The output will be returned as a dictionary:
{
"data": "<formatted content>",
"file": "your_file_name.<ext>" # if saving is enabled
}
๐ Example:
result = sync_scraper(
url="https://example.com",
output_format_type="json",
save_to_file=True,
file_name="example_output"
)
print(f"Saved output to {result['file']}")
Now you're set to save clean, readable data in your preferred formatโprogrammatically or from the CLI.
๐งฐ Advanced Usage
Force Rendering (Playwright)
from scrapesome import sync_scraper
content = sync_scraper("https://example.com", force_playwright=True)
content
Custom User Agents
from scrapesome import sync_scraper
content = sync_scraper("https://example.com", user_agents=["MyCustomAgent/1.0"])
content
Control Redirects
from scrapesome import sync_scraper
content = sync_scraper("https://example.com", allow_redirects=False)
content
similarly async_scraper can also be used.
๐งช Testing
Run tests with:
pytest --cov=scrapesome tests/
Target coverage: 75โ100%
โ๏ธ Environment Configuration
ScrapeSome reads from environment variables if a .env file is present.
Example .env
LOG_LEVEL=INFO
OUTPUT_FORMAT=text
FETCH_PLAYWRIGHT_TIMEOUT=10
FETCH_PAGE_TIMEOUT=10
USER_AGENTS=["Mozilla/5.0 (Windows NT 10.0; Win64; x64)......."]
| Key | Description |
|---|---|
| FETCH_PLAYWRIGHT_TIMEOUT | Timeout for Playwright-rendered pages (in seconds) |
| FETCH_PAGE_TIMEOUT | Timeout for standard page fetch (in seconds) |
| LOG_LEVEL | Logging verbosity (DEBUG, INFO, WARNING, etc.) |
| OUTPUT_FORMAT | Default output format (text, markdown, json, html) |
| USER_AGENTS | Default user agents ("Mozilla/5.0 (Windows NT 10.0; Win64; x64).......") |
๐ Output Formats
JSON Example
Get json version
from scrapesome import sync_scraper
content = sync_scraper("https://example.com", output_format_type="json")
content
Output
{
"title": "Example Domain",
"description": "This domain is for use in illustrative examples.",
"url": "https://example.com"
}
Markdown
Convert HTML to Markdown with:
from scrapesome import sync_scraper
content = sync_scraper("https://adenuniversity.us", output_format_type="markdown")
content
Output
# Online Global Masters that boost your global career
**ADENย University** offers students access to professionals who operate in the world of business and administration, who share their knowledge and acumen collaboratively with their students in all **academic programs** offered at ADEN.
[About Us](about-aden-university)
Watch testimonial video
##### Watch testimonial video
ร
[
](https://res.cloudinary.com/cruminott/video/upload/vc_auto,w_auto,q_auto,f_auto/adenu/aden-university-3.mp4)
## ADEN University offers the following academic programs
[](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")
##### [EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")
The ADEN University Executive MBA is designed to strengthen business leaders to manage...
* **37** credits
* **15** months
* **Spanish Only**
[Visit EMBA Course](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")
[](https://adenuniversity.us/academics/global-mba/ "GLOBAL MBA. Master of Business Administration")
##### [GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/academics/global-mba/ "GLOBAL MBA. Master of Business Administration")
The Global MBA is designed to prepare business leaders to manage companies in an...
* **36** credits
* **14** months
* **Spanish and English**
similarly async_scraper can also be used.
๐ Project Structure
scrapesome/
โโโ .gitignore
โโโ pytest.ini
โโโ mkdocs.yml
โโโ .github/
โ โโโ workflows/
โ โ โโโ deploy.yml
โ โโโ ISSUE_TEMPLATE/
โ โ โโโ index.md
โ โโโ PULL_REQUEST_TEMPLATE.md
โ โโโ CODE_OF_CONDUCT.md
โ โโโ SECURITY.md
โโโ __init__.py
โโโ cli.py
โโโ config.py
โโโ exceptions.py
โโโ formatter/
โ โโโ __init__.py
โ โโโ output_formatter.py
โโโ logging.py
โโโ scraper/
โ โโโ __init__.py
โ โโโ async_scraper.py
โ โโโ sync_scraper.py
โ โโโ rendering.py
โโโ utils/
โ โโโ __init__.py
โ โโโ file_writer.py
โโโ docs/
โ โโโ index.md
โ โโโ getting_started.md
โ โโโ usage.md
โ โโโ config.md
โ โโโ examples.md
โ โโโ cli.md
โ โโโ about.md
โ โโโ licence.md
โ โโโ file-saving.md
โ โโโ contribution.md
โ โโโ output-formats.md
โ โโโ assets/
โ โโโ images/
โ โโโ favicon.png
โโโ tests/
โ โโโ __init__.py
โ โโโ test_sync_scraper.py
โ โโโ test_async_scraper.py
โ โโโ test_config.py
โ โโโ test_logging.py
โ โโโ test_rendering.py
โ โโโ test_file_writer.py
โ โโโ test_output_formatter.py
โ โโโ test_cli.py
โโโ setup.py
โโโ requirements.txt
โโโ pyproject.toml
โโโ LICENSE
โโโ README.md
๐ License
MIT License ยฉ 2025
๐ค Contributions
Contributions are welcome! Whether it's bug reports, feature suggestions, or pull requests โ your help is appreciated.
To get started:
git clone https://github.com/scrapesome/scrapesome.git
cd scrapesome
Documentation & Community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapesome-0.1.0.tar.gz.
File metadata
- Download URL: scrapesome-0.1.0.tar.gz
- Upload date:
- Size: 28.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15cede95b3ba08dd24a85c274f1d9a6d929e83e8d86982c6e57e695be95dc251
|
|
| MD5 |
3ed289b8b58b6f8eb30336b906a030d9
|
|
| BLAKE2b-256 |
61400dab944d629f6ae017177a70f6641b89cbaa70fad979cbaba2d10038c680
|
File details
Details for the file scrapesome-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapesome-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
143a3159c846f62019ebcb0f4602a968f208b3e062c88011ed6ef186bc529265
|
|
| MD5 |
e8899592d72b3aa7f89a43769637f52b
|
|
| BLAKE2b-256 |
056f560ecdbe9755184d47312c5f70a34db35b5ac5ef8efc75e3b6297065e24d
|