A Powerful Web Scraper with dynamic rendering support.

These details have not been verified by PyPI

Project links

Project description

ScrapeSome

Scrapesome Logo

PyPI Python Downloads License Issues Discussions Contributors Forks Stars

ScrapeSome is a lightweight, flexible web scraping library with both synchronous and asynchronous support. It includes intelligent fallbacks, JavaScript page rendering, response formatting (HTML → Text/JSON/Markdown), and retry mechanisms. Ideal for developers who need robust scraping utilities with minimal setup.

💡 Why Use ScrapeSome?
🚀 Features
⚖ Comparison with Alternatives
📦 Installation
Playwright Setup
⚡ Quick Start
🖥️ CLI Usage
🧰 Advanced Usage
🧪 Testing
⚙️ Environment Configuration
📄 Output Formats
📁 Project Structure
🔒 License
🤝 Contributions

💡 Why Use ScrapeSome?

Handles both static and JS-heavy pages out of the box
Supports both sync and async scraping
Converts raw HTML into clean text, JSON, or Markdown
Works with minimal configuration (pip install scrapesome)
Handles timeouts, retries, redirects, user agents

🚀 Features

🔁 Sync + Async scraping support
🔄 Automatic retries and intelligent fallbacks
🧪 Playwright rendering fallback for JS-heavy pages
📝 Format responses as raw HTML, plain text, Markdown, or structured JSON
⚙️ Configurable: timeouts, redirects, user agents, and logging
🧪 Test coverage with pytest and pytest-asyncio

⚖ Comparison with Alternatives

Feature	ScrapeSome ✅	Playwright (Python)	Selenium + UC	Requests-HTML	Scrapy + Playwright
🧠 JS Rendering Support	✅ Auto fallback on 403/JS content	✅ Always (manual control)	✅ Always (manual control)	✅ Partial (via Pyppeteer)	✅ Requires setup
🔄 Automatic Fallback (403/Blank)	✅ Yes (seamless)	❌ Manual logic needed	❌ Manual logic needed	❌ No	❌ Needs per-request config
🔁 Uses Browser Engine	✅ Only when needed (Playwright)	✅ Always	✅ Always	✅ (Unstable, slow)	✅ Always (if enabled)
✅ Sync + Async Support	✅ Built-in	❌ Async only	❌ Manual (via threading)	❌ Sync only	❌ Async only (via plugin)
📝 JSON/Markdown/HTML Output	✅ Built-in formats	❌ Manual parsing	❌ Manual parsing	❌ Basic only	❌ Custom pipeline needed
⚡ Minimal Setup	✅ Near zero	❌ Code + browser install	❌ Driver + setup	✅ Simple pip install	❌ Complex + plugin setup
🔁 Retries, Timeouts, Agents	✅ Smart defaults built-in	❌ Manual handling	❌ Manual handling	❌ Limited	⚠️ Partial via settings
🧪 Pytest-Ready Out-of-the-box	✅ Fully testable	⚠️ Requires mocks	❌ Hard to test	❌ Minimal	⚠️ Needs testing harness
⚙️ Config via .env / Inline	✅ Flexible and optional	❌ Code/config only	❌ Manual via code	❌ Hardcoded mostly	⚠️ Project settings
📦 Install & Run in <1 Min	✅ Yes	❌ Setup required	❌ Driver + config needed	✅ Yes	❌ Needs project + plugin

📦 Installation

pip install scrapesome

Playwright Setup

ScrapeSome uses Playwright for JavaScript rendering fallback. To enable this, you need to install Playwright and its dependencies.

1. Install Playwright Python package if not installed

pip install playwright

2. Install Playwright browsers

playwright install

3. Install system dependencies

Playwright requires some system libraries to run browsers, which vary by operating system.

For Windows Playwright installs everything you need automatically with playwright install, so no additional setup is usually required.

For Linux (Ubuntu/Debian) Run the following command to install required system libraries:

playwright install-deps

If you don't have playwright CLI available, you can install dependencies manually:

sudo apt-get update
sudo apt-get install -y libwoff1 libopus0 libwebp6 libharfbuzz-icu0 libwebpmux3 \
                        libenchant-2-2 libhyphen0 libegl1 libglx0 libgudev-1.0-0 \
                        libevdev2 libgles2 libx264-160

Note: Package names may vary depending on your distribution and version.

For macOS You can install required libraries using Homebrew:

brew install harfbuzz enchant

After this setup, you should be able to use ScrapeSome with full Playwright rendering support!

⚡ Quick Start

Synchronous Example

from scrapesome import sync_scraper
html = sync_scraper("https://example.com")
html

Asynchronous Example

import asyncio
from scrapesome import async_scraper
html = asyncio.run(async_scraper("https://example.com"))
html

🖥️ CLI Usage

ScrapeSome also includes a powerful CLI for quick and easy scraping from the command line.

📦 Installation with CLI Support

To use the CLI, install with the optional cli extras:

pip install scrapesome[cli]

🔧 Basic Usage

scrapesome scrape --url https://example.com

This performs a synchronous scrape and outputs plain text by default.

⚙️ Available Options

Option	Description	Default
`--async-mode`	Use asynchronous scraping	False
`--force-playwright`	Force JavaScript rendering using Playwright	False
`--output-format`	Choose `text`, `json`, `markdown`, or `html`	html

Examples

Basic scrape

scrapesome scrape --url https://example.com

Force Playwright rendering

scrapesome scrape --url https://example.com --force-playwright

Get JSON output

scrapesome scrape --url https://example.com --output-format json

Async scrape with markdown output

scrapesome scrape --url https://example.com --async-mode --output-format markdown

📄 File Saving

ScrapeSome allows you to format and save your scraped content with zero hassle—both via the CLI and in Python code.

💻 Save Output to File

Use these flags to save your output directly from the command line:

--save-to-file or -s: Enable saving to a file
--file-name or -n: Desired filename (extension added automatically)
--output-format or -f: One of html, text, markdown, or json

⚠️ Note: When saving to a file, only one URL can be scraped at a time.

📦 Example:

scrapesome scrape --url "https://example.com" --output-format markdown  --save-to-file --file-name output

👉 This saves the result as output.md.

Save Output in Code

The sync_scraper function supports saving to file using two optional flags:

save_to_file=True: Enables saving
file_name="your_file_name": Sets the base filename (extension inferred from format)

The output will be returned as a dictionary:

{
    "data": "<formatted content>",
    "file": "your_file_name.<ext>"  # if saving is enabled
}

📌 Example:

result = sync_scraper(url="https://example.com", output_format_type="json", save_to_file=True, file_name="example_output")
print(f"Saved output to {result.get('file')}")

Now you're set to save clean, readable data in your preferred format—programmatically or from the CLI.

🧰 Advanced Usage

Force Rendering (Playwright)

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", force_playwright=True)
content

Custom User Agents

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", user_agents=["MyCustomAgent/1.0"])
content

Control Redirects

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", allow_redirects=False)
content

similarly async_scraper can also be used.

🧪 Testing

Run tests with:

pytest --cov=scrapesome tests/

Target coverage: 75–100%

⚙️ Environment Configuration

ScrapeSome reads from environment variables if a .env file is present.

Example .env

LOG_LEVEL=INFO
OUTPUT_FORMAT=text
FETCH_PLAYWRIGHT_TIMEOUT=10
FETCH_PAGE_TIMEOUT=10
USER_AGENTS=["Mozilla/5.0 (Windows NT 10.0; Win64; x64)......."]

Key	Description
FETCH_PLAYWRIGHT_TIMEOUT	Timeout for Playwright-rendered pages (in seconds)
FETCH_PAGE_TIMEOUT	Timeout for standard page fetch (in seconds)
LOG_LEVEL	Logging verbosity (DEBUG, INFO, WARNING, etc.)
OUTPUT_FORMAT	Default output format (text, markdown, json, html)
USER_AGENTS	Default user agents ("Mozilla/5.0 (Windows NT 10.0; Win64; x64).......")

📄 Output Formats

JSON Example

Get json version

from scrapesome import sync_scraper
content = sync_scraper("https://example.com", output_format_type="json")
content

Output

{
  "title": "Example Domain",
  "description": "This domain is for use in illustrative examples.",
  "url": "https://example.com"
}

Markdown

Convert HTML to Markdown with:

from scrapesome import sync_scraper
content = sync_scraper("https://adenuniversity.us", output_format_type="markdown")
content

Output

# Online Global Masters that boost your global career

**ADEN University** offers students access to professionals who operate in the world of business and administration, who share their knowledge and acumen collaboratively with their students in all **academic programs** offered at ADEN.

[About Us](about-aden-university)


Watch testimonial video 


##### Watch testimonial video

×

[

](https://res.cloudinary.com/cruminott/video/upload/vc_auto,w_auto,q_auto,f_auto/adenu/aden-university-3.mp4)



## ADEN University offers the following academic programs

[![EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/files/2016/06/ADENU_miniatura_Emba_900-1-820x400.jpg "EXECUTIVE MBA. Master of Business Administration")](https://adenuniversity.us/academics/executive-mba/  "EXECUTIVE MBA. Master of Business Administration")

##### [EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")

The ADEN University Executive MBA is designed to strengthen business leaders to manage...

* **37** credits
* **15** months
* **Spanish Only**

[Visit EMBA Course](https://adenuniversity.us/academics/executive-mba/ "EXECUTIVE MBA. Master of Business Administration")

[![GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/files/2016/06/ADENU_miniatura_MBAgl1_900-820x400.jpg "GLOBAL MBA. Master of Business Administration")](https://adenuniversity.us/academics/global-mba/  "GLOBAL MBA. Master of Business Administration")

##### [GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/academics/global-mba/ "GLOBAL MBA. Master of Business Administration")

The Global MBA is designed to prepare business leaders to manage companies in an...

* **36** credits
* **14** months
* **Spanish and English**

similarly async_scraper can also be used.

📁 Project Structure

scrapesome/
├── .gitignore
├── pytest.ini
├── mkdocs.yml
├── .github/
│   ├── workflows/
│   │   └── deploy.yml
│   ├── ISSUE_TEMPLATE/
│   │   └── index.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   ├── CODE_OF_CONDUCT.md
│   └── SECURITY.md
├── __init__.py
├── cli.py
├── config.py
├── exceptions.py
├── formatter/
│   ├── __init__.py
│   └── output_formatter.py
├── logging.py
├── scraper/
│   ├── __init__.py
│   ├── async_scraper.py
│   ├── sync_scraper.py
│   └── rendering.py
├── utils/
│   ├── __init__.py
│   └── file_writer.py
├── docs/
│   ├── index.md
│   ├── getting_started.md
│   ├── usage.md
│   ├── config.md
│   ├── examples.md
│   ├── cli.md
│   ├── about.md
│   ├── licence.md
│   ├── file-saving.md
│   ├── contribution.md
│   ├── output-formats.md
│   └── assets/
│       └── images/
│           └── favicon.png
├── tests/
│   ├── __init__.py
│   ├── test_sync_scraper.py
│   ├── test_async_scraper.py
│   ├── test_config.py
│   ├── test_logging.py
│   ├── test_rendering.py
│   ├── test_file_writer.py
│   ├── test_output_formatter.py
│   └── test_cli.py
├── setup.py
├── requirements.txt
├── pyproject.toml
├── LICENSE
└── README.md

🔒 License

🤝 Contributions

Contributions are welcome! Whether it's bug reports, feature suggestions, or pull requests — your help is appreciated.

To get started:

git clone https://github.com/scrapesome/scrapesome.git
cd scrapesome

Community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.4

Jun 4, 2025

0.1.3

Jun 4, 2025

0.1.2

Jun 4, 2025

0.1.1

Jun 4, 2025

0.1.0

Jun 4, 2025

0.0.9

Jun 2, 2025

0.0.8

May 31, 2025

0.0.7

May 27, 2025

0.0.6

May 27, 2025

0.0.5

May 25, 2025

0.0.4

May 25, 2025

0.0.3

May 25, 2025

0.0.2

May 24, 2025

0.0.1

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapesome-0.1.4.tar.gz (28.9 kB view details)

Uploaded Jun 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapesome-0.1.4-py3-none-any.whl (22.2 kB view details)

Uploaded Jun 4, 2025 Python 3

File details

Details for the file scrapesome-0.1.4.tar.gz.

File metadata

Download URL: scrapesome-0.1.4.tar.gz
Upload date: Jun 4, 2025
Size: 28.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrapesome-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`102ec3163a6b2ff9b8eeb6edc47d600b28ad4ec1c8c4799bfd9f7f825b416c42`
MD5	`9ba1d26f25ae65fb8c62e01109887ce3`
BLAKE2b-256	`cd7f096bd6ac4cdd43db2039bb2b74c3a60bfad44753a82febbe0f7afb25a751`

See more details on using hashes here.

File details

Details for the file scrapesome-0.1.4-py3-none-any.whl.

File metadata

Download URL: scrapesome-0.1.4-py3-none-any.whl
Upload date: Jun 4, 2025
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrapesome-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96eb4b0d13dd390d8ae8c9628f5439eb3c2cbcb5508e08d5aca537edb83901ad`
MD5	`17b0468bef7922c90a4d9bec387f668b`
BLAKE2b-256	`d1e0c2cfa061506baa357d232fad6f7bf914bc411fd6f47a62910bbb95283bbd`

See more details on using hashes here.

scrapesome 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScrapeSome

Table of Contents

💡 Why Use ScrapeSome?

🚀 Features

⚖ Comparison with Alternatives

📦 Installation

Playwright Setup

1. Install Playwright Python package if not installed

2. Install Playwright browsers

3. Install system dependencies

⚡ Quick Start

🖥️ CLI Usage

📦 Installation with CLI Support

🔧 Basic Usage

⚙️ Available Options

Examples

Basic scrape

Force Playwright rendering

Get JSON output

Async scrape with markdown output

📄 File Saving

💻 Save Output to File

📦 Example:

Save Output in Code

📌 Example:

🧰 Advanced Usage

🧪 Testing

⚙️ Environment Configuration

📄 Output Formats

Markdown

📁 Project Structure

🔒 License

🤝 Contributions

Community

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes