Skip to main content

Summarize web pages and PDFs with Google Gemini

Project description

AI Knowledge Summarizer

aiknowledge packages the core logic from the original notebook into a reusable library that can be published on PyPI. It classifies incoming URLs as either standard web pages or PDF documents, extracts text and imagery, and sends the materials to Google Gemini for a structured summary.

Installation

The project uses Playwright for browser automation. Install the Python package and the Chromium browser binaries:

pip install aiknowledge
playwright install chromium

Additional runtime dependencies (such as PyMuPDF) are pulled in automatically via the package metadata.

Usage

import logging
from aiknowledge import summarize_url

logger = logging.getLogger("demo")
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

result = summarize_url(
    "https://example.com/article",
    google_api_key="YOUR_API_KEY",
    logger=logger,
)

print(result.summary)

Key features:

  • Automatic detection of PDF vs. HTML content.
  • Smart truncation of large text blocks and screenshot size management for web pages.
  • PDF rendering and text extraction powered by PyMuPDF.
  • Customizable logging: pass any logging.Logger instance or rely on the built-in no-op logger.
  • Configurable Gemini prompt, model selection, and request limits.

Configuration Options

summarize_url accepts several optional keyword arguments:

  • prompt: supply a custom Gemini prompt string. The default prompt produces an English analyst-style summary.
  • max_chars: maximum number of characters retained from the extracted text (default 6000).
  • max_image_mb: per-image size ceiling in megabytes for web page screenshots (default 4.0).
  • max_pdf_pages: number of PDF pages to process (default 5).
  • request_timeout: timeout in seconds used for HTTP and Playwright navigation (default 20).

The Google API key can be provided explicitly or via the GOOGLE_API_KEY environment variable.

Development

Install the local package in editable mode and the Playwright browser binary:

pip install -e .
playwright install chromium

Then run static checks:

python -m compileall src

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autowebpdfsummarizer-0.1.0.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autowebpdfsummarizer-0.1.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file autowebpdfsummarizer-0.1.0.tar.gz.

File metadata

  • Download URL: autowebpdfsummarizer-0.1.0.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for autowebpdfsummarizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3549d6a275c97cc70a16de5102fff59b9f824b669db8a75b298361de5227d171
MD5 313e301c209852866d97f3c1d3b53f14
BLAKE2b-256 51189bf3e9fe2f260ec2917e36e4208a8ea05c54951db5d3883aca3c8dc41fe2

See more details on using hashes here.

File details

Details for the file autowebpdfsummarizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for autowebpdfsummarizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 855b24343bbe5b9731366704ce1f5690c043b604c81611ed1ad52521d0b9c894
MD5 3787def023d15f1671a8e958ec9a3b16
BLAKE2b-256 eb7e4adae8829a42c6326a4ff225f4c1e1cc9f8af9c79a2a065968591d083939

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page