Skip to main content

NLWeb Crawler - Web crawling and indexing service

Project description

Crawler

Distributed web crawler for schema.org structured data.

Architecture

Master/worker pattern running as separate pods in Kubernetes:

  • Master: Flask API + job scheduler
  • Worker: Queue processor (embedding + upload to Azure AI Search)

Flow: Parse schema.org sitemaps → queue JSON files → embed → upload

Endpoints

  • GET / - Web UI
  • GET /api/status - System status
  • POST /api/sites - Add site to crawl
  • GET /api/queue/status - Queue statistics

Commands

Run make help for the full list. Key targets:

make dev     # Run master + worker via Docker Compose
make test    # Run pytest
make build   # Build image to ACR
make deploy  # Deploy to AKS via Helm

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlweb_crawler-0.6.0.tar.gz (84.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlweb_crawler-0.6.0-py3-none-any.whl (93.3 kB view details)

Uploaded Python 3

File details

Details for the file nlweb_crawler-0.6.0.tar.gz.

File metadata

  • Download URL: nlweb_crawler-0.6.0.tar.gz
  • Upload date:
  • Size: 84.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nlweb_crawler-0.6.0.tar.gz
Algorithm Hash digest
SHA256 e489d2e4ab62a166d427b9319bcc9049faba89eb8c4bf5bf6c5644265bef6c5c
MD5 6eefcdde7722319b837869f7efab7674
BLAKE2b-256 6aa9907394770378e8ab04e7f269ee77806811311cbcee103c275f2935759acf

See more details on using hashes here.

File details

Details for the file nlweb_crawler-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: nlweb_crawler-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 93.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nlweb_crawler-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9f46e9b9637d80165ef6fe29e9cc583702bc94cc31a83ce352c48073ddd6663a
MD5 a7b05d0d99a94dbe1b88fd2aec3e80fd
BLAKE2b-256 5c739ee3eaf904b9af12f6b4426d6e838cff58aa848deb9c6463caa4d6256886

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page