Skip to main content

NLWeb Crawler - Web crawling and indexing service

Project description

Crawler

Distributed web crawler for schema.org structured data.

Architecture

Master/worker pattern running as separate pods in Kubernetes:

  • Master: Flask API + job scheduler
  • Worker: Queue processor (embedding + upload to Azure AI Search)

Flow: Parse schema.org sitemaps → queue JSON files → embed → upload

Endpoints

  • GET / - Web UI
  • GET /api/status - System status
  • POST /api/sites - Add site to crawl
  • GET /api/queue/status - Queue statistics

Commands

Run make help for the full list. Key targets:

make dev     # Run master + worker via Docker Compose
make test    # Run pytest
make build   # Build image to ACR
make deploy  # Deploy to AKS via Helm

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlweb_crawler-0.7.0.tar.gz (85.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlweb_crawler-0.7.0-py3-none-any.whl (94.1 kB view details)

Uploaded Python 3

File details

Details for the file nlweb_crawler-0.7.0.tar.gz.

File metadata

  • Download URL: nlweb_crawler-0.7.0.tar.gz
  • Upload date:
  • Size: 85.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nlweb_crawler-0.7.0.tar.gz
Algorithm Hash digest
SHA256 9caaa38b0cf7622035e1df4523cf49ca64e1a96297fc6e7908d896d92142fc45
MD5 e742c1bd8448abd299ef15644f7bcab2
BLAKE2b-256 9ac084259351b0d4bacf8ee4450664a18db8caf0203dd6049237abcb95c114da

See more details on using hashes here.

File details

Details for the file nlweb_crawler-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: nlweb_crawler-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 94.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nlweb_crawler-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef129661a609a062b3291b61634aac72ef632595131fd77f882763b60b735f1b
MD5 d56ba78c778ba2f9c66d3094670610be
BLAKE2b-256 aa376da6e1f363ec576b3149ae281b1ab6ac1f2c8a8c41752094a164474866d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page