Scrape websites and convert them to clean markdown format
Project description
scrape2md 🕷️ → 📝
Scrape entire websites and convert to clean markdown — perfect for LLM training data, RAG systems, and AI applications. Handles iframes, JavaScript navigation, and complex site structures.
Why Markdown?
Markdown is the ideal format for working with LLMs:
- ✅ Clean, structured text for training language models
- ✅ Perfect for RAG (Retrieval-Augmented Generation) pipelines
- ✅ Easy to process, chunk, and embed for vector databases
- ✅ Human-readable and Git-friendly for documentation
Features
- 🕷️ Full site crawling with automatic link discovery
- 🖼️ Iframe support for embedded content
- 🧹 Smart cleanup removes navigation, boilerplate, and duplicates
- 📝 Clean markdown output with readable filenames
- 🚀 Headless browser powered by Playwright (handles JavaScript)
Installation
pip install scrape2md
playwright install chromium # One-time browser setup
Quick Start
CLI:
scrape2md https://example.com
scrape2md https://site.com -o docs -m 50 -d 2.0
Python:
from scrape2md import WebScraper
scraper = WebScraper("https://example.com", "output", max_pages=50)
scraper.scrape_site()
Options
scrape2md <url> [options]
-o, --output DIR Output directory (default: scraped_sites)
-m, --max-pages N Max pages to scrape (default: 100)
-d, --delay SECONDS Delay between requests (default: 1.0)
--download-images Download images (off by default)
How It Works
- Discovers site structure from navigation menus
- Crawls pages breadth-first with Playwright (handles JavaScript)
- Extracts content from iframes and dynamic elements
- Strips boilerplate (nav, footer, ads, login forms)
- Converts to clean markdown with smart filenames
- Detects and skips duplicate content
Output
scraped_sites/
└── example_com/
├── Home.md
├── About Us.md
├── Documentation.md
└── ...
Limitations
- Requires Chromium browser (installed via Playwright)
- Doesn't handle login-protected content
- Google Docs embeds are linked but not downloaded
- Default limit: 100 pages per site (configurable)
Development
git clone https://github.com/taralika/scrape2md.git
cd scrape2md
pip install -e .[dev] # Install with dev dependencies
playwright install chromium # One-time browser setup
pytest # Run tests
black . # Format code
ruff check . # Lint code
mypy src/ # Type checking
Contributing
Pull requests welcome! Please open an issue first to discuss major changes.
License
MIT License - see LICENSE file for details
Author
Anand Taralika - GitHub
Changelog
0.1.0 (2025-11-19) — Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape2md-0.1.0.tar.gz.
File metadata
- Download URL: scrape2md-0.1.0.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
984d960db31ec7dfcace3b238338be8b136f62d43d83bd2ccee6d3f1e1fdc9be
|
|
| MD5 |
2568f9c69681e818d255ed2b39422761
|
|
| BLAKE2b-256 |
e3c6865bb486c7273afee48d8b59da18a75f0776cce0f80bf2410771fb16546e
|
File details
Details for the file scrape2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrape2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4f82c81b76a76c522ed2d3f98667e37edca79b2a1886b3f627ee9ef1729fdab
|
|
| MD5 |
d8b2ff41ab51ac2ad3084547dc4ca49a
|
|
| BLAKE2b-256 |
0c492cc9042f9bd63f710a9ce4b1d24764709aec3706239dde89d9aceca18c2c
|