Skip to main content

A fasttool for crawling websites and compiling PDFs of their pages

Project description

InSite

by Bloom Research

InSite is a Python module for crawling websites and compiling PDFs of their pages. It's primarily intended for crawling code documentation websites to download PDFs for offline knowledge supplementation and RAG implementations in LLMs.

Features

  • Efficient parallel web crawling with Playwright
  • Smart link discovery even for non-standard link formats
  • Content filtering with positive and negative filter patterns
  • Proper media rendering before PDF conversion
  • Hierarchical PDF organization based on URL structure
  • PDF merging capability for creating comprehensive documentation

Installation

pip install insite

Requirements

  • Python 3.7+
  • Playwright
  • pypdf

After installation, you'll need to install the Playwright browsers:

playwright install

Usage Examples

Basic Web Crawling

import asyncio
from insite import InsiteScraper

async def main():
    # Create a scraper for a documentation site
    scraper = InsiteScraper("https://docs.python.org/3/")
    
    # Get all links on the site
    links = await scraper()
    
    print(f"Found {len(links)} links")
    
asyncio.run(main())

Converting Pages to PDFs

import asyncio
from insite import InsiteScraper, InsiteProcessor

async def main():
    # First, get all links from a site
    scraper = InsiteScraper("https://docs.python.org/3/library/")
    links = await scraper()
    
    # Then convert them to PDFs
    processor = InsiteProcessor(output_dir="python_docs")
    successes, failures = await processor.process_links(links)
    
    print(f"Successfully created {successes} PDFs")
    
    # Optionally create a single merged PDF
    master_file = processor.merge_to_masterfile("python_library_docs.pdf")
    print(f"Created master file: {master_file}")
    
asyncio.run(main())

Filtering Content

import asyncio
import re
from insite import InsiteScraper, InsiteProcessor

async def main():
    # Only include specific sections and exclude others
    positive_filters = ['/library/']  # Only library documentation 
    negative_filters = [
        '/archives/',                 # Skip archived content
        re.compile(r'\.(jpg|gif)$'),  # Skip images
    ]
    
    scraper = InsiteScraper(
        "https://docs.python.org/3/",
        max_concurrent=10,  # Use 10 concurrent workers
        debug=True,         # Enable debug output
        positive_filters=positive_filters,
        negative_filters=negative_filters
    )
    
    links = await scraper()
    
    # Process the filtered links
    processor = InsiteProcessor(output_dir="python_library_docs")
    await processor.process_links(links)
    
asyncio.run(main())

Command-Line Usage

The module includes a command-line tool for quick documentation scraping:

python -m insite.cli --url https://docs.python.org/3/ --output python_docs --max-pages 100 --create-master

License

GNU General Public License - GPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insite-0.1.0.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

insite-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file insite-0.1.0.tar.gz.

File metadata

  • Download URL: insite-0.1.0.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for insite-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7a8f0b4c5e028239a335ff5a845e1df2a18be6c86042e2e64b0f9d96cf240f96
MD5 520d032b51e351a278c6a04de2847be4
BLAKE2b-256 f8d40bd914873a54468787091ba5b82a2f6d15d28e48e0c1d2121054dee009d7

See more details on using hashes here.

File details

Details for the file insite-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: insite-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for insite-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63fc323b97dae66c13816214aa710a8f43b2351f7cb2200c013d844d07c3e17e
MD5 80243cdad036885b9f19b82b50be45dc
BLAKE2b-256 3995d7257f0e728b1e3024a551f1097b57aa33caf2605fecff92c6cf0d5a308c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page