Skip to main content

A modular adaptive web crawler framework with incremental, continuous, and scope-limited crawling support.

Project description

Summary — The 9 Foundational Crawler Categories

Category Core Question It Answers Example Type
Purpose Why is it crawling? Scraping crawler
Crawl Strategy How does it traverse pages? BFS, DFS, Focused
Scheduling Behavior When does it crawl? Incremental, Continuous
Architecture How is it organized? Distributed crawler
Scope How wide does it crawl? Site-wide crawler
Data Access Method Where does it get data from? API crawler
Ethical Behavior How does it treat site rules? Polite crawler
Intelligence Level How does it make decisions? LLM-guided crawler
Integration Role How does it fit into a pipeline? Modular crawler

By Purpose (Goal or Intent)

Defines why the crawler exists — what it aims to achieve.

Type Description

  • Discovery Crawler - Finds and collects URLs or metadata; builds an index or frontier.
  • Scraping Crawler - Extracts structured data from pages (e.g., tables, text, entities).
  • Monitoring Crawler - Tracks updates or changes in content over time.
  • Archival Crawler - Saves full page copies for preservation or offline analysis.
  • Testing / Auditing Crawler - Used for SEO, broken-link checking, or site compliance validation.

By Crawl Strategy (Traversal Logic)

Defines how pages are selected or ordered during crawling.

Type Description

  • Breadth-First (BFS) - Crawl all pages at one depth before moving deeper.
  • Depth-First (DFS) - Crawl one path as deep as possible before backtracking.
  • Priority-Based - Assign numerical priority scores to URLs.
  • Adaptive Adjust - strategy dynamically based on feedback or results.
  • Context-Aware - Use HTML structure and semantics to guide crawling decisions.

By Scheduling Behavior (Temporal Logic)

Defines when and how often the crawler operates.

Type Description

  • One-Shot - Runs once and stops after completion.
  • Incremental - Re-crawls known pages periodically to detect updates.
  • Continuous - Never stops constantly cycles through crawl/revisit loops.
  • Event-Driven - Triggered by signals such as webhooks or detected changes.

By Scope (Coverage Target)

Defines how broadly the crawler explores the web.

Type Description

  • Focused - only relevant pages based on topic or keyword.
  • Site-Wide - Restricted to one domain or subdomain.
  • Multi-Domain - Crawls a fixed list of domains.
  • Vertical / Domain-Specific - Focused on one industry (e.g., sports, jobs, e-commerce).
  • Web-Scale - Crawls the entire public web; search-engine level.

By Architecture (System Design)

Defines how the crawler is built and organized internally.

Type-Description

  • Centralized Single control node managing all crawl tasks.
  • Distributed Multiple coordinated nodes sharing workload and URL queues.
  • Peer-to-Peer Decentralized — nodes share discovered URLs without a central coordinator.
  • Cloud / Serverless Uses scalable, ephemeral functions to perform crawl tasks.

By Data Access Method

Defines where and how data is retrieved.

Type Description

  • HTML / Page Crawler Fetches and parses HTML pages.
  • API Crawler Pulls structured data through APIs.
  • Headless / Rendered Crawler Uses browsers or headless engines to handle JavaScript.
  • Hybrid Crawler Combines HTML, API, and headless methods adaptively.

By Ethical or Policy Behavior

Defines how the crawler interacts with site policies.

Type Description

  • Polite Crawler Obeys robots.txt, rate limits, and crawl delays.
  • Aggressive Crawler Ignores some restrictions (not recommended).
  • Authenticated Crawler Operates within login-required environments.

By Intelligence Level

Defines how much decision-making or AI the crawler uses.

Type Description

  • Rule-Based Follows fixed rules or regex filters.
  • Heuristic-Based Uses handcrafted scoring or relevance functions.
  • ML-Guided Uses machine learning for link scoring or prioritization.
  • LLM-Guided Uses large language models to interpret pages and guide navigation.

By Integration Role (System Position)

Defines how the crawler fits into the larger data pipeline.

Type Description

  • Standalone Crawler - Operates independently, outputs raw pages or URLs.
  • Coupled Crawler - Integrates scraping logic directly inside the crawl loop.
  • Modular Crawler - Works with separate scraper, parser, and storage components.
  • Streaming Crawler - Feeds data continuously into real-time pipelines (Kafka, etc.).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartspider-0.0.3.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartspider-0.0.3-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file smartspider-0.0.3.tar.gz.

File metadata

  • Download URL: smartspider-0.0.3.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for smartspider-0.0.3.tar.gz
Algorithm Hash digest
SHA256 3bfaa3b50cbe41556007ac3a6f98d669a82a9871d581111e39ffc10327af1022
MD5 b8c8c07866813b75bccd2c0f66526e51
BLAKE2b-256 1c3a7d66ae5a5495f6fd2a5ae0d1f07b37d7237cb5b9968a9a5600f734f8541f

See more details on using hashes here.

File details

Details for the file smartspider-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: smartspider-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for smartspider-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7ecd69aa982c76d8131ef4ac524e9546af24ad4b1acf08bd9fd7de498e271ec6
MD5 550db24b540dbf97d270dfca5752d620
BLAKE2b-256 87f519a2c09787b4abc2efa6cd10df4df3bbaa6f1d0e12fcd8979189f81058cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page