Skip to main content

A lightning fast tool for crawling websites and compiling PDFs of their pages

Project description

InSite

by Bloom Research

InSite is a Python module for crawling websites and compiling PDFs of their pages. It's primarily intended for crawling code documentation websites to download PDFs for offline knowledge supplementation and RAG implementations in LLMs.

Features

  • Efficient parallel web crawling with Playwright
  • Smart link discovery even for non-standard link formats
  • Content filtering with positive and negative filter patterns
  • Proper media rendering before PDF conversion
  • Hierarchical PDF organization based on URL structure
  • PDF merging capability for creating comprehensive documentation

Installation

pip install insite

Requirements

  • Python 3.7+
  • Playwright
  • pypdf

After installation, you'll need to install the Playwright browsers:

playwright install

Usage Examples

Basic Web Crawling

import asyncio
from insite import InsiteScraper

async def main():
    # Create a scraper for a documentation site
    scraper = InsiteScraper("https://docs.python.org/3/")
    
    # Get all links on the site
    links = await scraper()
    
    print(f"Found {len(links)} links")
    
asyncio.run(main())

Converting Pages to PDFs

import asyncio
from insite import InsiteScraper, InsiteProcessor

async def main():
    # First, get all links from a site
    scraper = InsiteScraper("https://docs.python.org/3/library/")
    links = await scraper()
    
    # Then convert them to PDFs
    processor = InsiteProcessor(output_dir="python_docs")
    successes, failures = await processor.process_links(links)
    
    print(f"Successfully created {successes} PDFs")
    
    # Optionally create a single merged PDF
    master_file = processor.merge_to_masterfile("python_library_docs.pdf")
    print(f"Created master file: {master_file}")
    
asyncio.run(main())

Filtering Content

import asyncio
import re
from insite import InsiteScraper, InsiteProcessor

async def main():
    # Only include specific sections and exclude others
    positive_filters = ['/library/']  # Only library documentation 
    negative_filters = [
        '/archives/',                 # Skip archived content
        re.compile(r'\.(jpg|gif)$'),  # Skip images
    ]
    
    scraper = InsiteScraper(
        "https://docs.python.org/3/",
        max_concurrent=10,  # Use 10 concurrent workers
        debug=True,         # Enable debug output
        positive_filters=positive_filters,
        negative_filters=negative_filters
    )
    
    links = await scraper()
    
    # Process the filtered links
    processor = InsiteProcessor(output_dir="python_library_docs")
    await processor.process_links(links)
    
asyncio.run(main())

Command-Line Usage

The module includes a command-line tool for quick documentation scraping:

python -m insite.cli --url https://docs.python.org/3/ --output python_docs --max-pages 100 --create-master

License

GNU General Public License - GPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insite-0.1.1.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insite-0.1.1-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file insite-0.1.1.tar.gz.

File metadata

  • Download URL: insite-0.1.1.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for insite-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6fa23aabe35e94e860412f45c8eb8fd9aaf8d056da122b241a02d415bfce7a61
MD5 316d6f737b665e2a0f5b09263f049944
BLAKE2b-256 756dafe505c2bfdd24ce6da1c68bc29922a18b95f2f80783b2b6f2affc8527ac

See more details on using hashes here.

File details

Details for the file insite-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: insite-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for insite-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ceee48c3c2b52b93189e45ac83a85c4dc43b2869e21739c428cf9f0195e07956
MD5 7f388af29b64699ad4888ac9324f49e7
BLAKE2b-256 bc656c0eacfc9f5ed01a1ef47d8e49b1e5df3a5ed591d712af679d3d8359a3d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page