Skip to main content

NLWeb Crawler - Web crawling and indexing service

Project description

Crawler

Distributed web crawler for schema.org structured data.

Architecture

Master/worker pattern running as separate pods in Kubernetes:

  • Master: Flask API + job scheduler
  • Worker: Queue processor (embedding + upload to Azure AI Search)

Flow: Parse schema.org sitemaps → queue JSON files → embed → upload

Endpoints

  • GET / - Web UI
  • GET /api/status - System status
  • POST /api/sites - Add site to crawl
  • GET /api/queue/status - Queue statistics

Commands

Run make help for the full list. Key targets:

make dev     # Run master + worker via Docker Compose
make test    # Run pytest
make build   # Build image to ACR
make deploy  # Deploy to AKS via Helm

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlweb_crawler-0.7.1.tar.gz (85.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlweb_crawler-0.7.1-py3-none-any.whl (94.5 kB view details)

Uploaded Python 3

File details

Details for the file nlweb_crawler-0.7.1.tar.gz.

File metadata

  • Download URL: nlweb_crawler-0.7.1.tar.gz
  • Upload date:
  • Size: 85.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nlweb_crawler-0.7.1.tar.gz
Algorithm Hash digest
SHA256 8b94995fcead145721f3f29d8becaa50f901d983df6fea9452c5368e944fbe32
MD5 242fc2a8d87bd7d2ac642d8fe5dacb5c
BLAKE2b-256 a1dae5f7c5dd199f55a6467c6d7d29b350d5cab332aac5aa4c6e2fd93f08674d

See more details on using hashes here.

File details

Details for the file nlweb_crawler-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: nlweb_crawler-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 94.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nlweb_crawler-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e813dee6adfd4a293bd1a635550b7e72ea9aa466c19d3871ba55d23417e23658
MD5 5873e8872bcf1c0e1fe6f5b3713bf65a
BLAKE2b-256 d69103a4c6693bfce71f9ff882cc8564ffd3f297074ae21586189362d6ced289

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page