Skip to main content

An intelligent web service to automatically detect web content and extract information from it.

Project description

Webspot

Webspot is an intelligent web service to automatically detect web content and extract information from it.

Demo

中文

Screenshots

Detected Results

Extracted Fields

Extracted Data

Get Started

Docker

Make sure you have installed Docker and Docker Compose.

# clone git repo
git clone https://github.com/crawlab-team/webspot

# start docker containers
docker-compose up -d

Then you can access the web UI at http://localhost:9999.

API Reference

Once you started Webspot, you can go to http://localhost:9999/redoc to view the API reference.

Architecture

The overall process of how Webspot detects meaningful elements from HTML or web pages is shown in the following figure.

graph LR
    hr[HtmlRequester]
    gl[GraphLoader]
    d[Detector]
    r[Results]

    hr --"html + json"--> gl --"graph"--> d --"output"--> r

Development

You can follow the following guidance to get started.

Pre-requisites

  • Python >=3.8 and <=3.10
  • Go 1.16 or higher
  • MongoDB 4.2 or higher

Install dependencies

# dependencies
pip install -r requirements.txt

Configure Environment Variables

Database configuration is located in .env file. You can copy the example file and modify it.

cp .env.example .env

Start web server

# start development server
python main.py web

Code Structure

The core code is located in webspot directory. The main.py file is the entry point of the web server.

webspot
├── cmd     # command line tools
├── crawler # web crawler
├── data    # data files (html, json, etc.)
├── db      # database
├── detect  # web content detection
├── graph   # graph module
├── models  # models
├── request # request helper
├── test    # test cases
├── utils   # utilities
└── web     # web server

TODOs

Webspot is aimed at automating the process of web content detection and extraction. It is far from ready for production use. The following features are planned to be implemented in the future.

  • Table detection
  • Nested list detection
  • Export to spiders
  • Advanced browser request

Disclaimer

Please follow the local laws and regulations when using Webspot. The author is not responsible for any legal issues caused by. Please read the Disclaimer for details.

Community

If you are interested in Webspot, please add the author's WeChat account "tikazyq1" noting "Webspot" to enter the discussion group.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webspot-0.1.4.tar.gz (34.3 kB view details)

Uploaded Source

Built Distribution

webspot-0.1.4-py3-none-any.whl (47.9 kB view details)

Uploaded Python 3

File details

Details for the file webspot-0.1.4.tar.gz.

File metadata

  • Download URL: webspot-0.1.4.tar.gz
  • Upload date:
  • Size: 34.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for webspot-0.1.4.tar.gz
Algorithm Hash digest
SHA256 d1d2a1241d4f4d78f6f954b2b26947f89c72bbfb40392ccd50ef38f2f5bbd76e
MD5 be64b5997b2fabd615eddffbef74f2f1
BLAKE2b-256 63afaf638f6fdfe3b38ef11618cb22ff29f84198f6ae71f51eba7dccf0519e56

See more details on using hashes here.

File details

Details for the file webspot-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: webspot-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 47.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for webspot-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 20a206ecc0c9c989b56acb32c722a410b1f83f14224dca3be3950e047fe95d92
MD5 e6b3d4f9e1f751847608e4c35c7d5cd2
BLAKE2b-256 02ac0421ad36913da9dfd2bb4afa7e8fdb3d05a84e5a3fddc8fe2fff84c0dad8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page