Skip to main content

A Powerful Web Scraper with dynamic rendering support.

Project description

ScrapeSome

Scrapesome Logo

PyPI Python Downloads License Build Status Issues Discussions Contributors Forks Stars

ScrapeSome is a lightweight, flexible web scraping library with both synchronous and asynchronous support. It includes intelligent fallbacks, JavaScript page rendering, response formatting (HTML โ†’ Text/JSON/Markdown), and retry mechanisms. Ideal for developers who need robust scraping utilities with minimal setup.


Table of Contents

๐Ÿ’ก Why Use ScrapeSome?

  • Handles both static and JS-heavy pages out of the box
  • Supports both sync and async scraping
  • Converts raw HTML into clean text, JSON, or Markdown
  • Works with minimal configuration (pip install scrapesome)
  • Handles timeouts, retries, redirects, user agents

๐Ÿš€ Features

  • ๐Ÿ” Sync + Async scraping support
  • ๐Ÿ”„ Automatic retries and intelligent fallbacks
  • ๐Ÿงช Playwright rendering fallback for JS-heavy pages
  • ๐Ÿ“ Format responses as raw HTML, plain text, Markdown, or structured JSON
  • โš™๏ธ Configurable: timeouts, redirects, user agents, and logging
  • ๐Ÿงช Test coverage with pytest and pytest-asyncio

โš– Comparison with Alternatives

Feature ScrapeSome โœ… Scrapy Selenium/UC Playwright (Raw)
โœ… Sync + Async Scraping โœ… Built-in โŒ Async only* โŒ Manual โŒ Manual
๐Ÿง  JS Rendering (Fallback) โœ… Seamless โŒ Plugin setup โœ… Full โœ… Full
๐Ÿ“ Output as JSON/Markdown/HTML โœ… Built-in โŒ Requires custom โŒ Manual parsing โŒ Manual parsing
๐Ÿ” Retry & Timeout Handling โœ… Built-in โš ๏ธ Requires config โŒ Manual โŒ Manual
โšก Minimal Setup (Boilerplate) โœ… Near zero โŒ Needs project โŒ Driver setup โŒ Browser install
๐Ÿงช Testable out-of-the-box โœ… Pytest-ready โš ๏ธ Complex โŒ โŒ
๐Ÿ› ๏ธ Config via .env or inline โœ… Simple โš ๏ธ Complex โŒ โŒ
๐Ÿ“ฆ Install & Run in <1 Min โœ… Yes โŒ โŒ โŒ

๐Ÿ“ฆ Installation

pip install scrapesome

Playwright Setup

ScrapeSome uses Playwright for JavaScript rendering fallback. To enable this, you need to install Playwright and its dependencies.

1. Install Playwright Python package if not installed

pip install playwright

2. Install Playwright browsers

playwright install

3. Install system dependencies

Playwright requires some system libraries to run browsers, which vary by operating system.

For Windows Playwright installs everything you need automatically with playwright install, so no additional setup is usually required.

For Linux (Ubuntu/Debian) Run the following command to install required system libraries:

playwright install-deps

If you don't have playwright CLI available, you can install dependencies manually:

sudo apt-get update
sudo apt-get install -y libwoff1 libopus0 libwebp6 libharfbuzz-icu0 libwebpmux3 \
                        libenchant-2-2 libhyphen0 libegl1 libglx0 libgudev-1.0-0 \
                        libevdev2 libgles2 libx264-160

Note: Package names may vary depending on your distribution and version.

For macOS You can install required libraries using Homebrew:

brew install harfbuzz enchant

After this setup, you should be able to use ScrapeSome with full Playwright rendering support!

โšก Quick Start

Synchronous Example

from scrapesome import sync_scraper
html = sync_scraper("https://example.com")
html

Asynchronous Example

import asyncio
from scrapesome import async_scraper
html = asyncio.run(async_scraper("https://example.com"))
html

๐Ÿ–ฅ๏ธ CLI Usage

ScrapeSome also includes a powerful CLI for quick and easy scraping from the command line.

๐Ÿ“ฆ Installation with CLI Support

To use the CLI, install with the optional cli extras:

pip install scrapesome[cli]

๐Ÿ”ง Basic Usage

scrapesome scrape --url https://example.com

This performs a synchronous scrape and outputs plain text by default.

โš™๏ธ Available Options

Option Description Default
--async-mode Use asynchronous scraping False
--force-playwright Force JavaScript rendering using Playwright False
--output-format Choose text, json, markdown, or html html

Examples

Basic scrape

scrapesome scrape --url https://example.com

Force Playwright rendering

scrapesome scrape --url https://example.com --force-playwright

Get JSON output

scrapesome scrape --url https://example.com --output-format json

Async scrape with markdown output

scrapesome scrape --url https://example.com --async-mode --output-format markdown

๐Ÿงฐ Advanced Usage

Force Rendering (Playwright)

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", force_playwright=True)
content

Custom User Agents

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", user_agents=["MyCustomAgent/1.0"])
content

Control Redirects

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", allow_redirects=False)
content

similarly async_scraper can also be used.

๐Ÿงช Testing

Run tests with:

pytest --cov=scrapesome tests/

Target coverage: 75โ€“100%

โš™๏ธ Environment Configuration

ScrapeSome reads from environment variables if a .env file is present.

Example .env

LOG_LEVEL=INFO
OUTPUT_FORMAT=text
FETCH_PLAYWRIGHT_TIMEOUT=10
FETCH_PAGE_TIMEOUT=10
USER_AGENTS=["Mozilla/5.0 (Windows NT 10.0; Win64; x64)......."]
Key Description
FETCH_PLAYWRIGHT_TIMEOUT Timeout for Playwright-rendered pages (in seconds)
FETCH_PAGE_TIMEOUT Timeout for standard page fetch (in seconds)
LOG_LEVEL Logging verbosity (DEBUG, INFO, WARNING, etc.)
OUTPUT_FORMAT Default output format (text, markdown, json, html)
USER_AGENTS Default user agents ("Mozilla/5.0 (Windows NT 10.0; Win64; x64).......")

๐Ÿ“„ Output Formats

JSON Example

Get json version

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", output_format_type="json")
content

Output

{
  "title": "Example Domain",
  "description": "This domain is for use in illustrative examples.",
  "url": "https://example.com"
}

Markdown

Convert HTML to Markdown with:

from scrapesome import sync_scraper
content = sync_scraper("https://adenuniversity.us", output_format_type="markdown")
content

Output

# Online Global Masters that boost your global career

**ADENย University** offers students access to professionals who operate in the world of business and administration, who share their knowledge and acumen collaboratively with their students in all **academic programs** offered at ADEN.

[About Us](about-aden-university)


Watch testimonial video 


##### Watch testimonial video

ร—

[

](https://res.cloudinary.com/cruminott/video/upload/vc_auto,w_auto,q_auto,f_auto/adenu/aden-university-3.mp4)



## ADEN University offers the following academic programs

[![EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/files/2016/06/ADENU_miniatura_Emba_900-1-820x400.jpg "EXECUTIVE MBA. Master of Business Administration")](https://adenuniversity.us/academics/executive-mba/  "EXECUTIVE MBA. Master of Business Administration")

##### [EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")

The ADEN University Executive MBA is designed to strengthen business leaders to manage...

* **37** credits
* **15** months
* **Spanish Only**

[Visit EMBA Course](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")

[![GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/files/2016/06/ADENU_miniatura_MBAgl1_900-820x400.jpg "GLOBAL MBA. Master of Business Administration")](https://adenuniversity.us/academics/global-mba/  "GLOBAL MBA. Master of Business Administration")

##### [GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/academics/global-mba/ "GLOBAL MBA. Master of Business Administration")

The Global MBA is designed to prepare business leaders to manage companies in an...

* **36** credits
* **14** months
* **Spanish and English**

similarly async_scraper can also be used.

๐Ÿ“ Project Structure

scrapesome/
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ pytest.ini
โ”œโ”€โ”€ .github/
โ”‚   โ”œโ”€โ”€ workflows/
โ”‚   โ”‚   โ””โ”€โ”€ deploy.yml
โ”‚   โ”œโ”€โ”€ ISSUE_TEMPLATE/
โ”‚   โ”‚   โ””โ”€โ”€ index.md
โ”‚   โ”œโ”€โ”€ PULL_REQUEST_TEMPLATE.md
โ”‚   โ”œโ”€โ”€ CODE_OF_CONDUCT.md
โ”‚   โ””โ”€โ”€ SECURITY.md
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ cli.py
โ”œโ”€โ”€ config.py
โ”œโ”€โ”€ exceptions.py
โ”œโ”€โ”€ formatter/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ output_formatter.py
โ”œโ”€โ”€ logging.py
โ”œโ”€โ”€ scraper/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ async_scraper.py
โ”‚   โ”œโ”€โ”€ sync_scraper.py
โ”‚   โ””โ”€โ”€ rendering.py
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ index.md
โ”‚   โ”œโ”€โ”€ getting_started.md
โ”‚   โ”œโ”€โ”€ usage.md
โ”‚   โ”œโ”€โ”€ config.md
โ”‚   โ”œโ”€โ”€ examples.md
โ”‚   โ”œโ”€โ”€ cli.md
โ”‚   โ”œโ”€โ”€ about.md
โ”‚   โ””โ”€โ”€ licence.md
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_sync_scraper.py
โ”‚   โ”œโ”€โ”€ test_async_scraper.py
โ”‚   โ””โ”€โ”€ test_config.py
โ”œโ”€โ”€ setup.py
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ LICENSE
โ””โ”€โ”€ README.md

๐Ÿ”’ License

MIT License ยฉ 2025

๐Ÿค Contributions

Contributions are welcome! Whether it's bug reports, feature suggestions, or pull requests โ€” your help is appreciated.

To get started:

git clone https://github.com/scrapesome/scrapesome.git
cd scrapesome

Documentation & Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapesome-0.0.9.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapesome-0.0.9-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapesome-0.0.9.tar.gz.

File metadata

  • Download URL: scrapesome-0.0.9.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrapesome-0.0.9.tar.gz
Algorithm Hash digest
SHA256 ac41d84e2c4510638022c6af9ec9a1c6f3e749af08aba45c943c3c39f39efcba
MD5 eb1655cfcfc4d942b754e272cffb3154
BLAKE2b-256 20ed157834075fa63a694169ad8735f76c76d257e93bbdb8c432ff11aa66fa7b

See more details on using hashes here.

File details

Details for the file scrapesome-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: scrapesome-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrapesome-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 59fe8f961f29c7f62691ab92d4a2d78339c95142f812b580f90c38780f963a5e
MD5 c49f12c45fc931c8fd00054530cac36d
BLAKE2b-256 e8520a89af18b4beb8d7bb97e944c4ed50e0f457840be706dae89f0bbf312381

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page