Skip to main content

Async webpage type classifier using URL, meta, structural, and content signals

Project description

Page Classifier

An asynchronous webpage type classifier that uses URL, metadata, structural, and content signals to accurately determine the type of a given web page (e.g., product page, blog post, contact-us, etc.).

Installation

You can install the package directly from PyPI (once published):

pip install page-classifier

For development, clone the repository and install it using the Makefile or pip:

git clone <your-repository-url>
cd page-classifier
make install
# or for dev dependencies
make dev

Usage

Python API

You can use the classifier programmatically in your asynchronous Python applications:

import asyncio
from page_classifier import PageClassifier, ClassifierConfig

async def main():
    # Initialize the classifier
    classifier = PageClassifier(config=ClassifierConfig())
    
    # Classify a URL
    result = await classifier.classify_url(
        url="https://www.ganpatihandicrafts.com/printed-kurti.html"
    )
    
    # View the results
    print(result.to_dict())

if __name__ == "__main__":
    asyncio.run(main())

Command-Line Interface (CLI)

The package provides a built-in CLI to easily classify pages from your terminal:

python -m page_classifier "https://www.ganpatihandicrafts.com/printed-kurti.html"

CLI Options

  • url: The URL to fetch and classify.
  • --platform NAME: Force a specific platform instead of auto-detecting. Only that platform's routing rules will fire.
  • --timeout SECONDS: HTTP fetch timeout (default: 15.0).
  • --json: Print the full result as JSON instead of a summary.
  • --list-platforms: List all supported platform names and exit.

Example with JSON output and timeout:

python -m page_classifier "https://example.com/product/123" --timeout 10 --json

Development

A Makefile is included to streamline development tasks:

  • make install: Install the project.
  • make dev: Install the project with development dependencies.
  • make test: Run the pytest suite.
  • make build: Build the distribution packages (sdist and wheel).
  • make publish: Build and publish the package to PyPI using twine.
  • make clean: Clean up build artifacts and cache directories.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_classifier-0.1.0.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

page_classifier-0.1.0-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file page_classifier-0.1.0.tar.gz.

File metadata

  • Download URL: page_classifier-0.1.0.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_classifier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 21b28c7d28582f102536eb1fb1d9aa0c5b419240b02d490845d1d04726084ac5
MD5 ef46b77b96468268a2c3d8a3dcb1c311
BLAKE2b-256 70d15a00e4c75a2bcd4e2408aa8b9c615112f536f72bb18b27be48c15af71dd8

See more details on using hashes here.

File details

Details for the file page_classifier-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for page_classifier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0af447c88b7c8217f430a6da5c2d4fde2ab898a153d696e55f30297fa21e1549
MD5 3c070cc5a4b5f807ac7599072081ed04
BLAKE2b-256 87ee37ce18f66810d5ddf51bf83fe54d0ce2df89d53c968fcbc7e676313e0537

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page