Skip to main content

Summarize web pages and PDFs with Google Gemini

Project description

AI Knowledge Summarizer

AutoWebPdfSummarizer packages the core logic from the original notebook into a reusable library that can be published on PyPI. It classifies incoming URLs as either standard web pages or PDF documents, extracts text and imagery, and sends the materials to Google Gemini for a structured summary.

Installation

The project uses Playwright for browser automation. Install the Python package and the Chromium browser binaries:

pip install AutoWebPdfSummarizer
playwright install chromium

Additional runtime dependencies (such as PyMuPDF) are pulled in automatically via the package metadata.

Usage

import logging
from AutoWebPdfSummarizer import summarize_url

logger = logging.getLogger("demo")
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

result = summarize_url(
    "https://example.com/article",
    google_api_key="YOUR_API_KEY",
    logger=logger,
)

print(result.summary)

Key features:

  • Automatic detection of PDF vs. HTML content.
  • Smart truncation of large text blocks and screenshot size management for web pages.
  • PDF rendering and text extraction powered by PyMuPDF.
  • Customizable logging: pass any logging.Logger instance or rely on the built-in no-op logger.
  • Configurable Gemini prompt, model selection, and request limits.

Configuration Options

summarize_url accepts several optional keyword arguments:

  • prompt: supply a custom Gemini prompt string. The default prompt produces an English analyst-style summary.
  • max_chars: maximum number of characters retained from the extracted text (default 6000).
  • max_image_mb: per-image size ceiling in megabytes for web page screenshots (default 4.0).
  • max_pdf_pages: number of PDF pages to process (default 5).
  • request_timeout: timeout in seconds used for HTTP and Playwright navigation (default 300).

The Google API key can be provided explicitly or via the GOOGLE_API_KEY environment variable.

Development

Install the local package in editable mode and the Playwright browser binary:

pip install -e .
playwright install chromium

Then run static checks:

python -m compileall src

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autowebpdfsummarizer-0.1.5.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autowebpdfsummarizer-0.1.5-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file autowebpdfsummarizer-0.1.5.tar.gz.

File metadata

  • Download URL: autowebpdfsummarizer-0.1.5.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for autowebpdfsummarizer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 42f83c703b8e3ca056bafd457ee8ae6160285b8db9d14168ab29f18d214655c1
MD5 7994cd67907833d2930674ab813523c5
BLAKE2b-256 da0bfc9c38e15a5b71cbb986eb8d84b266a15237d88f2173f6fc2637f720f217

See more details on using hashes here.

File details

Details for the file autowebpdfsummarizer-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for autowebpdfsummarizer-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f9ce2b03b139ba3a909b45adf417e9b277564a8fce273b5e2800772dd893f9b2
MD5 55181fd26b9b786765562650bad84042
BLAKE2b-256 e4a1d869979b974d2c8e86ef0a9f2582516d3325dd7a2ded1a45b25a2d6bba9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page