Async webpage type classifier using URL, meta, structural, and content signals
Project description
Page Classifier
An asynchronous webpage type classifier that uses URL, metadata, structural, and content signals to accurately determine the type of a given web page (e.g., product page, blog post, contact-us, etc.).
Installation
You can install the package directly from PyPI (once published):
pip install page-classifier
For development, clone the repository and install it using the Makefile or pip:
git clone <your-repository-url>
cd page-classifier
make install
# or for dev dependencies
make dev
Usage
Python API
You can use the classifier programmatically in your asynchronous Python applications:
import asyncio
from page_classifier import PageClassifier, ClassifierConfig
async def main():
# Initialize the classifier
classifier = PageClassifier(config=ClassifierConfig())
# Classify a URL
result = await classifier.classify_url(
url="https://www.ganpatihandicrafts.com/printed-kurti.html"
)
# View the results
print(result.to_dict())
if __name__ == "__main__":
asyncio.run(main())
Command-Line Interface (CLI)
The package provides a built-in CLI to easily classify pages from your terminal:
python -m page_classifier "https://www.ganpatihandicrafts.com/printed-kurti.html"
CLI Options
url: The URL to fetch and classify.--platform NAME: Force a specific platform instead of auto-detecting. Only that platform's routing rules will fire.--timeout SECONDS: HTTP fetch timeout (default: 15.0).--json: Print the full result as JSON instead of a summary.--list-platforms: List all supported platform names and exit.
Example with JSON output and timeout:
python -m page_classifier "https://example.com/product/123" --timeout 10 --json
Development
A Makefile is included to streamline development tasks:
make install: Install the project.make dev: Install the project with development dependencies.make test: Run the pytest suite.make build: Build the distribution packages (sdistandwheel).make publish: Build and publish the package to PyPI using twine.make clean: Clean up build artifacts and cache directories.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file page_classifier-0.1.0.tar.gz.
File metadata
- Download URL: page_classifier-0.1.0.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21b28c7d28582f102536eb1fb1d9aa0c5b419240b02d490845d1d04726084ac5
|
|
| MD5 |
ef46b77b96468268a2c3d8a3dcb1c311
|
|
| BLAKE2b-256 |
70d15a00e4c75a2bcd4e2408aa8b9c615112f536f72bb18b27be48c15af71dd8
|
File details
Details for the file page_classifier-0.1.0-py3-none-any.whl.
File metadata
- Download URL: page_classifier-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0af447c88b7c8217f430a6da5c2d4fde2ab898a153d696e55f30297fa21e1549
|
|
| MD5 |
3c070cc5a4b5f807ac7599072081ed04
|
|
| BLAKE2b-256 |
87ee37ce18f66810d5ddf51bf83fe54d0ce2df89d53c968fcbc7e676313e0537
|