An intelligent web service to automatically detect web content and extract information from it.
Project description
Webspot
Webspot is an intelligent web service to automatically detect web content and extract information from it.
Screenshots
Detected Results
Extracted Fields
Extracted Data
Get Started
Docker
Make sure you have installed Docker and Docker Compose.
# clone git repo
git clone https://github.com/crawlab-team/webspot
# start docker containers
docker-compose up -d
Then you can access the web UI at http://localhost:9999.
API Reference
Once you started Webspot, you can go to http://localhost:9999/redoc to view the API reference.
Architecture
The overall process of how Webspot detects meaningful elements from HTML or web pages is shown in the following figure.
graph LR
hr[HtmlRequester]
gl[GraphLoader]
d[Detector]
r[Results]
hr --"html + json"--> gl --"graph"--> d --"output"--> r
Development
You can follow the following guidance to get started.
Pre-requisites
- Python >=3.8 and <=3.10
- Go 1.16 or higher
- MongoDB 4.2 or higher
Install dependencies
# dependencies
pip install -r requirements.txt
Configure Environment Variables
Database configuration is located in .env
file. You can copy the example file and modify it.
cp .env.example .env
Start web server
# start development server
python main.py web
Code Structure
The core code is located in webspot
directory. The main.py
file is the entry point of the web server.
webspot
├── cmd # command line tools
├── crawler # web crawler
├── data # data files (html, json, etc.)
├── db # database
├── detect # web content detection
├── graph # graph module
├── models # models
├── request # request helper
├── test # test cases
├── utils # utilities
└── web # web server
TODOs
Webspot is aimed at automating the process of web content detection and extraction. It is far from ready for production use. The following features are planned to be implemented in the future.
- Table detection
- Nested list detection
- Export to spiders
- Advanced browser request
Disclaimer
Please follow the local laws and regulations when using Webspot. The author is not responsible for any legal issues caused by. Please read the Disclaimer for details.
Community
If you are interested in Webspot, please add the author's WeChat account "tikazyq1" noting "Webspot" to enter the discussion group.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webspot-0.1.4.tar.gz
.
File metadata
- Download URL: webspot-0.1.4.tar.gz
- Upload date:
- Size: 34.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1d2a1241d4f4d78f6f954b2b26947f89c72bbfb40392ccd50ef38f2f5bbd76e |
|
MD5 | be64b5997b2fabd615eddffbef74f2f1 |
|
BLAKE2b-256 | 63afaf638f6fdfe3b38ef11618cb22ff29f84198f6ae71f51eba7dccf0519e56 |
File details
Details for the file webspot-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: webspot-0.1.4-py3-none-any.whl
- Upload date:
- Size: 47.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20a206ecc0c9c989b56acb32c722a410b1f83f14224dca3be3950e047fe95d92 |
|
MD5 | e6b3d4f9e1f751847608e4c35c7d5cd2 |
|
BLAKE2b-256 | 02ac0421ad36913da9dfd2bb4afa7e8fdb3d05a84e5a3fddc8fe2fff84c0dad8 |