A fasttool for crawling websites and compiling PDFs of their pages

These details have not been verified by PyPI

Project links

Homepage

Project description

InSite

by Bloom Research

InSite is a Python module for crawling websites and compiling PDFs of their pages. It's primarily intended for crawling code documentation websites to download PDFs for offline knowledge supplementation and RAG implementations in LLMs.

Features

Efficient parallel web crawling with Playwright
Smart link discovery even for non-standard link formats
Content filtering with positive and negative filter patterns
Proper media rendering before PDF conversion
Hierarchical PDF organization based on URL structure
PDF merging capability for creating comprehensive documentation

Installation

pip install insite

Requirements

Python 3.7+
Playwright
pypdf

After installation, you'll need to install the Playwright browsers:

playwright install

Usage Examples

Basic Web Crawling

import asyncio
from insite import InsiteScraper

async def main():
    # Create a scraper for a documentation site
    scraper = InsiteScraper("https://docs.python.org/3/")
    
    # Get all links on the site
    links = await scraper()
    
    print(f"Found {len(links)} links")
    
asyncio.run(main())

Converting Pages to PDFs

import asyncio
from insite import InsiteScraper, InsiteProcessor

async def main():
    # First, get all links from a site
    scraper = InsiteScraper("https://docs.python.org/3/library/")
    links = await scraper()
    
    # Then convert them to PDFs
    processor = InsiteProcessor(output_dir="python_docs")
    successes, failures = await processor.process_links(links)
    
    print(f"Successfully created {successes} PDFs")
    
    # Optionally create a single merged PDF
    master_file = processor.merge_to_masterfile("python_library_docs.pdf")
    print(f"Created master file: {master_file}")
    
asyncio.run(main())

Filtering Content

import asyncio
import re
from insite import InsiteScraper, InsiteProcessor

async def main():
    # Only include specific sections and exclude others
    positive_filters = ['/library/']  # Only library documentation 
    negative_filters = [
        '/archives/',                 # Skip archived content
        re.compile(r'\.(jpg|gif)$'),  # Skip images
    ]
    
    scraper = InsiteScraper(
        "https://docs.python.org/3/",
        max_concurrent=10,  # Use 10 concurrent workers
        debug=True,         # Enable debug output
        positive_filters=positive_filters,
        negative_filters=negative_filters
    )
    
    links = await scraper()
    
    # Process the filtered links
    processor = InsiteProcessor(output_dir="python_library_docs")
    await processor.process_links(links)
    
asyncio.run(main())

Command-Line Usage

The module includes a command-line tool for quick documentation scraping:

python -m insite.cli --url https://docs.python.org/3/ --output python_docs --max-pages 100 --create-master

License

GNU General Public License - GPLv3

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.1

Apr 24, 2025

This version

0.1.0

Apr 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insite-0.1.0.tar.gz (25.0 kB view details)

Uploaded Apr 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

insite-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Apr 24, 2025 Python 3

File details

Details for the file insite-0.1.0.tar.gz.

File metadata

Download URL: insite-0.1.0.tar.gz
Upload date: Apr 24, 2025
Size: 25.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for insite-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7a8f0b4c5e028239a335ff5a845e1df2a18be6c86042e2e64b0f9d96cf240f96`
MD5	`520d032b51e351a278c6a04de2847be4`
BLAKE2b-256	`f8d40bd914873a54468787091ba5b82a2f6d15d28e48e0c1d2121054dee009d7`

See more details on using hashes here.

File details

Details for the file insite-0.1.0-py3-none-any.whl.

File metadata

Download URL: insite-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2025
Size: 25.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for insite-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63fc323b97dae66c13816214aa710a8f43b2351f7cb2200c013d844d07c3e17e`
MD5	`80243cdad036885b9f19b82b50be45dc`
BLAKE2b-256	`3995d7257f0e728b1e3024a551f1097b57aa33caf2605fecff92c6cf0d5a308c`

See more details on using hashes here.

insite 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InSite

Features

Installation

Requirements

Usage Examples

Basic Web Crawling

Converting Pages to PDFs

Filtering Content

Command-Line Usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes