Skip to main content

Summarize web pages and PDFs with Google Gemini

Project description

AI Knowledge Summarizer

AutoWebPdfSummarizer packages the core logic from the original notebook into a reusable library that can be published on PyPI. It classifies incoming URLs as either standard web pages or PDF documents, extracts text and imagery, and sends the materials to Google Gemini for a structured summary.

Installation

The project uses Playwright for browser automation. Install the Python package and the Chromium browser binaries:

pip install AutoWebPdfSummarizer
playwright install chromium

Additional runtime dependencies (such as PyMuPDF) are pulled in automatically via the package metadata.

Usage

import logging
from AutoWebPdfSummarizer import summarize_url

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s - %(message)s",
)

result = summarize_url(
    "https://example.com/article",
    google_api_key="YOUR_API_KEY",
    logger=logging.getLogger("demo"),
    min_gemini_interval=2.0,
)

print(result.summary)

Key features:

  • Automatic detection of PDF vs. HTML content.
  • Smart truncation of large text blocks and screenshot size management for web pages.
  • PDF rendering and text extraction powered by PyMuPDF.
  • Customizable logging: pass any logging.Logger instance or rely on the built-in no-op logger.
  • Configurable Gemini prompt, model selection, and request limits.

Enabling detailed Gemini call tracing

Set the logger level to DEBUG to surface the additional timings that wrap every model.generate_content call:

logging.getLogger("demo").setLevel(logging.DEBUG)

The summarizer now reports how long each Gemini request takes and logs any exceptions before they propagate, making it easier to diagnose stalls around the "Sending %d parts to Gemini" message.

Running inside Google Colab with a debugger

Colab notebooks can forward debug sessions using debugpy. Install it and start a listener before invoking the summarizer:

!pip install debugpy

import debugpy

debugpy.listen(("0.0.0.0", 5678))
print("debugpy is listening on port 5678")
debugpy.wait_for_client()  # optional: pause until your IDE attaches

With the listener active, attach your local IDE (VS Code, PyCharm, etc.) to the running Colab kernel using the public URL and port 5678. Once connected you can set breakpoints in the notebook and inspect the Gemini calls while the enhanced logging streams to the notebook output.

Configuration Options

summarize_url accepts several optional keyword arguments:

  • prompt: supply a custom Gemini prompt string. The default prompt produces an English analyst-style summary.
  • max_chars: maximum number of characters retained from the extracted text (default 6000).
  • max_image_mb: per-image size ceiling in megabytes for web page screenshots (default 4.0).
  • max_pdf_pages: number of PDF pages to process (default 5).
  • request_timeout: timeout in seconds used for HTTP and Playwright navigation (default 300).
  • min_gemini_interval: minimum delay in seconds enforced between Gemini requests (default 2.0).

For streaming chunk-by-chunk updates when summarizing PDFs, use summarize_url_stream, which yields ChunkSummary objects for each processed chunk before emitting the final SummarizationResult.

The Google API key can be provided explicitly or via the GOOGLE_API_KEY environment variable.

Development

Install the local package in editable mode and the Playwright browser binary:

pip install -e .
playwright install chromium

Then run static checks:

python -m compileall src

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autowebpdfsummarizer-0.1.9.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autowebpdfsummarizer-0.1.9-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file autowebpdfsummarizer-0.1.9.tar.gz.

File metadata

  • Download URL: autowebpdfsummarizer-0.1.9.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for autowebpdfsummarizer-0.1.9.tar.gz
Algorithm Hash digest
SHA256 623d0f2de3020dc422607786b795821fa56d135714f30ca8ec357dc51b50e896
MD5 835bd72c85ef9b36dd6d0c1d18f8af1a
BLAKE2b-256 24c38c2371822e0ccbdd07328dd7f9a887ff7bdf345528ad361563330b11e1c1

See more details on using hashes here.

File details

Details for the file autowebpdfsummarizer-0.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for autowebpdfsummarizer-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 df1c1f0048243bfabe7b7760d308e35533ac210f7790fe593c89efd1d63e7687
MD5 1fa782d7c308f321a59a1eeb5678e256
BLAKE2b-256 9e759cfbc534aa642e2961ca30075716f041ca20c223112cb8b5cd91b4694a35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page