A lightning fast tool for crawling websites and compiling PDFs of their pages
Project description
InSite
by Bloom Research
InSite is a Python module for crawling websites and compiling PDFs of their pages. It's primarily intended for crawling code documentation websites to download PDFs for offline knowledge supplementation and RAG implementations in LLMs.
Features
- Efficient parallel web crawling with Playwright
- Smart link discovery even for non-standard link formats
- Content filtering with positive and negative filter patterns
- Proper media rendering before PDF conversion
- Hierarchical PDF organization based on URL structure
- PDF merging capability for creating comprehensive documentation
Installation
pip install insite
Requirements
- Python 3.7+
- Playwright
- pypdf
After installation, you'll need to install the Playwright browsers:
playwright install
Usage Examples
Basic Web Crawling
import asyncio
from insite import InsiteScraper
async def main():
# Create a scraper for a documentation site
scraper = InsiteScraper("https://docs.python.org/3/")
# Get all links on the site
links = await scraper()
print(f"Found {len(links)} links")
asyncio.run(main())
Converting Pages to PDFs
import asyncio
from insite import InsiteScraper, InsiteProcessor
async def main():
# First, get all links from a site
scraper = InsiteScraper("https://docs.python.org/3/library/")
links = await scraper()
# Then convert them to PDFs
processor = InsiteProcessor(output_dir="python_docs")
successes, failures = await processor.process_links(links)
print(f"Successfully created {successes} PDFs")
# Optionally create a single merged PDF
master_file = processor.merge_to_masterfile("python_library_docs.pdf")
print(f"Created master file: {master_file}")
asyncio.run(main())
Filtering Content
import asyncio
import re
from insite import InsiteScraper, InsiteProcessor
async def main():
# Only include specific sections and exclude others
positive_filters = ['/library/'] # Only library documentation
negative_filters = [
'/archives/', # Skip archived content
re.compile(r'\.(jpg|gif)$'), # Skip images
]
scraper = InsiteScraper(
"https://docs.python.org/3/",
max_concurrent=10, # Use 10 concurrent workers
debug=True, # Enable debug output
positive_filters=positive_filters,
negative_filters=negative_filters
)
links = await scraper()
# Process the filtered links
processor = InsiteProcessor(output_dir="python_library_docs")
await processor.process_links(links)
asyncio.run(main())
Command-Line Usage
The module includes a command-line tool for quick documentation scraping:
python -m insite.cli --url https://docs.python.org/3/ --output python_docs --max-pages 100 --create-master
License
GNU General Public License - GPLv3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file insite-0.1.1.tar.gz.
File metadata
- Download URL: insite-0.1.1.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fa23aabe35e94e860412f45c8eb8fd9aaf8d056da122b241a02d415bfce7a61
|
|
| MD5 |
316d6f737b665e2a0f5b09263f049944
|
|
| BLAKE2b-256 |
756dafe505c2bfdd24ce6da1c68bc29922a18b95f2f80783b2b6f2affc8527ac
|
File details
Details for the file insite-0.1.1-py3-none-any.whl.
File metadata
- Download URL: insite-0.1.1-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceee48c3c2b52b93189e45ac83a85c4dc43b2869e21739c428cf9f0195e07956
|
|
| MD5 |
7f388af29b64699ad4888ac9324f49e7
|
|
| BLAKE2b-256 |
bc656c0eacfc9f5ed01a1ef47d8e49b1e5df3a5ed591d712af679d3d8359a3d4
|