Skip to main content

A simple Python package to extract text content from a webpage.

Project description

webpage2content

A simple Python package that takes a web page (by URL) and extracts its main human-readable content. It uses LLM technology to remove all of the boilerplate webpage cruft (headers, footers, copyright and accessibility notices, advertisements, login and search controls, etc.) that isn't part of the main content of the page.

Installation

pip install webpage2content

Usage

Python

import openai
from webpage2content import webpage2content

text = webpage2content("http://mysite.com", openai.OpenAI(api_key="your_openai_api_key"))
print(text)

CLI

You can invoke webpage2content from the command line.

webpage2content https://slashdot.org/

If you don't have your OPENAI_API_KEY environment variable set, you can pass it to the CLI invocation as a second argument.

webpage2content https://slashdot.org/ sk-ABCD1234

You can also specify the OpenAI organization ID if needed.

webpage2content https://slashdot.org/ sk-ABCD1234 org-5678

Additional CLI Options

  • Logging Level: You can set the logging level using the -l or --log-level option.

    webpage2content -l DEBUG https://slashdot.org/
    
  • Version: Display the version number of the package.

    webpage2content -v
    
  • Help: Display help information.

    webpage2content -h
    

Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key.
  • OPENAI_ORGANIZATION_ID: Your OpenAI organization ID (optional).

.env File Support

The CLI will honor .env files for setting environment variables. Create a .env file in the same directory with the following content:

OPENAI_API_KEY=your_openai_api_key
OPENAI_ORGANIZATION_ID=your_openai_organization_id

Example

webpage2content -l INFO https://example.com/ sk-ABCD1234 org-5678

This command will extract the main content from https://example.com/ using the provided OpenAI API key and organization ID, with logging set to INFO level.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webpage2content-1.3.4.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

webpage2content-1.3.4-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file webpage2content-1.3.4.tar.gz.

File metadata

  • Download URL: webpage2content-1.3.4.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for webpage2content-1.3.4.tar.gz
Algorithm Hash digest
SHA256 d962dcec71cbcd05cebbd82566bddd6d9a12346bf0fa6b78e8a566320cfc9d05
MD5 e7ccaabfd1985c2a9c06f49329071176
BLAKE2b-256 32f61b92f5244e82ccaca7a696b32ab55df0e2e2577f39c55e74d998fe677313

See more details on using hashes here.

File details

Details for the file webpage2content-1.3.4-py3-none-any.whl.

File metadata

File hashes

Hashes for webpage2content-1.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 fc4df43ea8e43758f37a3203b090272f71461879c11782806050df01fa22647a
MD5 5629ce2f2acc7633ee7a1a9345a4da73
BLAKE2b-256 6362d723bf5eeaa97b96252952c34b491c5b4b9e17ee9e8b58edd5ccff721830

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page