A library for detecting login fields in HTML using DistilBERT.
Project description
HTML Login Field Detector
html-login-field-detector is a Python library designed to identify and process login fields in HTML documents. Powered by machine learning (DistilBERT) and modern web scraping tools, this library provides a robust solution for automating form detection in web applications.
Features
- Detects login forms in HTML documents.
- Utilizes Hugging Face's DistilBERT model for advanced text processing.
- Integrates seamlessly with Python web scraping workflows.
- Supports GPU acceleration for faster processing.
Installation
Using pip
To install the library along with the CPU-compatible dependencies:
pip install html-login-field-detector[cpu]
For GPU compatibility:
pip install html-login-field-detector[gpu] --extra-index-url https://download.pytorch.org/whl/cu118
Install System Dependencies
Run the following command to install Playwright's system dependencies:
playwright install-deps
Usage
from login_field_detector import LoginFieldDetector
# Initialize the detector
detector = LoginFieldDetector()
# Detect login fields in an HTML document
html_source = "<html>...</html>" # Your HTML content
result = detector.detect(html_source)
print(result) # Output details of detected login fields
Dataset
This project includes a dataset of login page URLs for training and testing purposes, located at dataset/training_urls.json. The dataset can be extended or updated as needed.
Development
Clone the repository and install the dependencies locally:
git clone https://github.com/ByVictorrr/LoginFieldDetector.git
cd LoginFieldDetector
# Install dependencies
pip install -e .[gpu,test]
playwright install
Running Tests
Run the tests using pytest:
pytest
License
This project is licensed under the MIT License.
Contributing
We welcome contributions! Please fork the repository, make changes, and submit a pull request.
Links
- Homepage: ByVictorrr on GitHub
- Repository: LoginFieldDetector
- Documentation: Docs
- Dataset:
dataset/training_urls.json
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html_login_field_detector-0.1.6.tar.gz.
File metadata
- Download URL: html_login_field_detector-0.1.6.tar.gz
- Upload date:
- Size: 22.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ad34c179990b80820256de1790e34b70bc85c0c6da97db32a911eb4816e64e1
|
|
| MD5 |
cf948222f1e0820534d7fb9579c17ff4
|
|
| BLAKE2b-256 |
fd19738bdd72aab60f4103e3ebb91b19a0dcc1c11945b83448e3e9e75cc86e58
|
File details
Details for the file html_login_field_detector-0.1.6-py3-none-any.whl.
File metadata
- Download URL: html_login_field_detector-0.1.6-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df267ba5cd9a00f188b00f3663a6907b2ffc1441f8598576dd15f53d523b5bd6
|
|
| MD5 |
f8b55b1f7f7960d3a31993b7df5ac2f9
|
|
| BLAKE2b-256 |
dd3ec51585622bedc01288f6b5d51eef2ce58de48dbfec276a0ab12df9c0ba19
|