A simple Python package to extract text content from a webpage.
Project description
webpage2content
A simple Python package that takes a web page (by URL) and extracts its main human-readable content. It uses LLM technology to remove all of the boilerplate webpage cruft (headers, footers, copyright and accessibility notices, advertisements, login and search controls, etc.) that isn't part of the main content of the page.
Installation
pip install webpage2content
Usage
Python
import openai
from webpage2content import webpage2content
text = webpage2content("http://mysite.com", openai.OpenAI(api_key="your_openai_api_key"))
print(text)
CLI
You can invoke webpage2content
from the command line.
webpage2content https://slashdot.org/
If you don't have your OPENAI_API_KEY
environment variable set, you can pass it to the CLI invocation as a second argument.
webpage2content https://slashdot.org/ sk-ABCD1234
You can also specify the OpenAI organization ID if needed.
webpage2content https://slashdot.org/ sk-ABCD1234 org-5678
Additional CLI Options
-
Logging Level: You can set the logging level using the
-l
or--log-level
option.webpage2content -l DEBUG https://slashdot.org/
-
Version: Display the version number of the package.
webpage2content -v
-
Help: Display help information.
webpage2content -h
Environment Variables
OPENAI_API_KEY
: Your OpenAI API key.OPENAI_ORGANIZATION_ID
: Your OpenAI organization ID (optional).
.env File Support
The CLI will honor .env
files for setting environment variables. Create a .env
file in the same directory with the following content:
OPENAI_API_KEY=your_openai_api_key
OPENAI_ORGANIZATION_ID=your_openai_organization_id
Example
webpage2content -l INFO https://example.com/ sk-ABCD1234 org-5678
This command will extract the main content from https://example.com/
using the provided OpenAI API key and organization ID, with logging set to INFO
level.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webpage2content-1.3.4.tar.gz
.
File metadata
- Download URL: webpage2content-1.3.4.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d962dcec71cbcd05cebbd82566bddd6d9a12346bf0fa6b78e8a566320cfc9d05 |
|
MD5 | e7ccaabfd1985c2a9c06f49329071176 |
|
BLAKE2b-256 | 32f61b92f5244e82ccaca7a696b32ab55df0e2e2577f39c55e74d998fe677313 |
File details
Details for the file webpage2content-1.3.4-py3-none-any.whl
.
File metadata
- Download URL: webpage2content-1.3.4-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc4df43ea8e43758f37a3203b090272f71461879c11782806050df01fa22647a |
|
MD5 | 5629ce2f2acc7633ee7a1a9345a4da73 |
|
BLAKE2b-256 | 6362d723bf5eeaa97b96252952c34b491c5b4b9e17ee9e8b58edd5ccff721830 |