Skip to main content

Summarize web pages and PDFs with Google Gemini

Project description

AI Knowledge Summarizer

AutoWebPdfSummarizer packages the core logic from the original notebook into a reusable library that can be published on PyPI. It classifies incoming URLs as either standard web pages or PDF documents, extracts text and imagery, and sends the materials to Google Gemini for a structured summary.

Installation

The project uses Playwright for browser automation. Install the Python package and the Chromium browser binaries:

pip install AutoWebPdfSummarizer
playwright install chromium

Additional runtime dependencies (such as PyMuPDF) are pulled in automatically via the package metadata.

Usage

import logging
from AutoWebPdfSummarizer import summarize_url

logger = logging.getLogger("demo")
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

result = summarize_url(
    "https://example.com/article",
    google_api_key="YOUR_API_KEY",
    logger=logger,
)

print(result.summary)

Key features:

  • Automatic detection of PDF vs. HTML content.
  • Smart truncation of large text blocks and screenshot size management for web pages.
  • PDF rendering and text extraction powered by PyMuPDF.
  • Customizable logging: pass any logging.Logger instance or rely on the built-in no-op logger.
  • Configurable Gemini prompt, model selection, and request limits.

Configuration Options

summarize_url accepts several optional keyword arguments:

  • prompt: supply a custom Gemini prompt string. The default prompt produces an English analyst-style summary.
  • max_chars: maximum number of characters retained from the extracted text (default 6000).
  • max_image_mb: per-image size ceiling in megabytes for web page screenshots (default 4.0).
  • max_pdf_pages: number of PDF pages to process (default 5).
  • request_timeout: timeout in seconds used for HTTP and Playwright navigation (default 20).

The Google API key can be provided explicitly or via the GOOGLE_API_KEY environment variable.

Development

Install the local package in editable mode and the Playwright browser binary:

pip install -e .
playwright install chromium

Then run static checks:

python -m compileall src

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autowebpdfsummarizer-0.1.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autowebpdfsummarizer-0.1.1-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file autowebpdfsummarizer-0.1.1.tar.gz.

File metadata

  • Download URL: autowebpdfsummarizer-0.1.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for autowebpdfsummarizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a684a21670c24c438adc1a5a232ebf7b206822127d272d17c2eac83e63ba53e5
MD5 4ce3eae8ac3806c02d0ca2bc2a4b82e5
BLAKE2b-256 049fd781410c822544702bbff7b1affc6d8d71d59a8116662e6a6ee2e3dcc85e

See more details on using hashes here.

File details

Details for the file autowebpdfsummarizer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for autowebpdfsummarizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b4fc5accbf75199f34db04849576216e3dd37cb78425cebdb315ea7102d2db79
MD5 8e15ffb73262657813cba7f28119312a
BLAKE2b-256 b28a37ba4e17a095defe5d563bfb09591cfa216cc00c6899d9de2cedf9fb791e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page