NLWeb Crawler - Web crawling and indexing service
Project description
Crawler
Distributed web crawler for schema.org structured data.
Architecture
Master/worker pattern running as separate pods in Kubernetes:
- Master: Flask API + job scheduler
- Worker: Queue processor (embedding + upload to Azure AI Search)
Flow: Parse schema.org sitemaps → queue JSON files → embed → upload
Endpoints
GET /- Web UIGET /api/status- System statusPOST /api/sites- Add site to crawlGET /api/queue/status- Queue statistics
Commands
Run make help for the full list. Key targets:
make dev # Run master + worker via Docker Compose
make test # Run pytest
make build # Build image to ACR
make deploy # Deploy to AKS via Helm
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlweb_crawler-0.6.1.tar.gz
(84.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlweb_crawler-0.6.1.tar.gz.
File metadata
- Download URL: nlweb_crawler-0.6.1.tar.gz
- Upload date:
- Size: 84.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2efa2726d008f2bafbb901b271bbd884c12afb42ceed5d899c9b59b6df7857b2
|
|
| MD5 |
5b8e0ae9b6e83e0fe1d2cb2aa4002ca0
|
|
| BLAKE2b-256 |
498bb5b75efb30cf7d13fe685b9ef7ddde574e8a09d7b659e7822f62aeb57c47
|
File details
Details for the file nlweb_crawler-0.6.1-py3-none-any.whl.
File metadata
- Download URL: nlweb_crawler-0.6.1-py3-none-any.whl
- Upload date:
- Size: 93.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bc7f7af924a23b717f5f1b8a8189e0c632a1696f4514f0e4d8d1b3708512129
|
|
| MD5 |
a041f07ead5f2991d7c655022dde4da8
|
|
| BLAKE2b-256 |
dcf17bea436420198773ef7e42492d3770e91c08a02615cc3780ed03513e5321
|