Skip to main content

A library for detecting login fields in HTML using DistilBERT.

Project description

HTML Login Field Detector

Coverage Status

html-login-field-detector is a Python library designed to identify and process login fields in HTML documents. Powered by machine learning (DistilBERT) and modern web scraping tools, this library provides a robust solution for automating form detection in web applications.

Features

  • Detects login forms in HTML documents.
  • Utilizes Hugging Face's DistilBERT model for advanced text processing.
  • Integrates seamlessly with Python web scraping workflows.
  • Supports GPU acceleration for faster processing.

Installation

Using pip

To install the library along with the CPU-compatible dependencies:

pip install html-login-field-detector[cpu]

For GPU compatibility:

pip install html-login-field-detector[gpu] --extra-index-url https://download.pytorch.org/whl/cu118

Install System Dependencies

Run the following command to install Playwright's system dependencies:

playwright install-deps

Usage

from login_field_detector import LoginFieldDetector

# Initialize the detector
detector = LoginFieldDetector()

# Detect login fields in an HTML document
html_source = "<html>...</html>"  # Your HTML content
result = detector.detect(html_source)

print(result)  # Output details of detected login fields

Dataset

This project includes a dataset of login page URLs for training and testing purposes, located at dataset/training_urls.json. The dataset can be extended or updated as needed.

Development

Clone the repository and install the dependencies locally:

git clone https://github.com/ByVictorrr/LoginFieldDetector.git
cd LoginFieldDetector

# Install dependencies
pip install -e .[gpu,test]
playwright install

Running Tests

Run the tests using pytest:

pytest

License

This project is licensed under the MIT License.

Contributing

We welcome contributions! Please fork the repository, make changes, and submit a pull request.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_login_field_detector-0.1.6.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_login_field_detector-0.1.6-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file html_login_field_detector-0.1.6.tar.gz.

File metadata

File hashes

Hashes for html_login_field_detector-0.1.6.tar.gz
Algorithm Hash digest
SHA256 1ad34c179990b80820256de1790e34b70bc85c0c6da97db32a911eb4816e64e1
MD5 cf948222f1e0820534d7fb9579c17ff4
BLAKE2b-256 fd19738bdd72aab60f4103e3ebb91b19a0dcc1c11945b83448e3e9e75cc86e58

See more details on using hashes here.

File details

Details for the file html_login_field_detector-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for html_login_field_detector-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 df267ba5cd9a00f188b00f3663a6907b2ffc1441f8598576dd15f53d523b5bd6
MD5 f8b55b1f7f7960d3a31993b7df5ac2f9
BLAKE2b-256 dd3ec51585622bedc01288f6b5d51eef2ce58de48dbfec276a0ab12df9c0ba19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page