Skip to main content

A modular adaptive web crawler framework with incremental, continuous, and scope-limited crawling support.

Project description

Summary — The 9 Foundational Crawler Categories

Category Core Question It Answers Example Type
Purpose Why is it crawling? Scraping crawler
Crawl Strategy How does it traverse pages? BFS, DFS, Focused
Scheduling Behavior When does it crawl? Incremental, Continuous
Architecture How is it organized? Distributed crawler
Scope How wide does it crawl? Site-wide crawler
Data Access Method Where does it get data from? API crawler
Ethical Behavior How does it treat site rules? Polite crawler
Intelligence Level How does it make decisions? LLM-guided crawler
Integration Role How does it fit into a pipeline? Modular crawler

By Purpose (Goal or Intent)

Defines why the crawler exists — what it aims to achieve.

Type Description

  • Discovery Crawler - Finds and collects URLs or metadata; builds an index or frontier.
  • Scraping Crawler - Extracts structured data from pages (e.g., tables, text, entities).
  • Monitoring Crawler - Tracks updates or changes in content over time.
  • Archival Crawler - Saves full page copies for preservation or offline analysis.
  • Testing / Auditing Crawler - Used for SEO, broken-link checking, or site compliance validation.

By Crawl Strategy (Traversal Logic)

Defines how pages are selected or ordered during crawling.

Type Description

  • Breadth-First (BFS) - Crawl all pages at one depth before moving deeper.
  • Depth-First (DFS) - Crawl one path as deep as possible before backtracking.
  • Priority-Based - Assign numerical priority scores to URLs.
  • Adaptive Adjust - strategy dynamically based on feedback or results.
  • Context-Aware - Use HTML structure and semantics to guide crawling decisions.

By Scheduling Behavior (Temporal Logic)

Defines when and how often the crawler operates.

Type Description

  • One-Shot - Runs once and stops after completion.
  • Incremental - Re-crawls known pages periodically to detect updates.
  • Continuous - Never stops constantly cycles through crawl/revisit loops.
  • Event-Driven - Triggered by signals such as webhooks or detected changes.

By Scope (Coverage Target)

Defines how broadly the crawler explores the web.

Type Description

  • Focused - only relevant pages based on topic or keyword.
  • Site-Wide - Restricted to one domain or subdomain.
  • Multi-Domain - Crawls a fixed list of domains.
  • Vertical / Domain-Specific - Focused on one industry (e.g., sports, jobs, e-commerce).
  • Web-Scale - Crawls the entire public web; search-engine level.

By Architecture (System Design)

Defines how the crawler is built and organized internally.

Type-Description

  • Centralized Single control node managing all crawl tasks.
  • Distributed Multiple coordinated nodes sharing workload and URL queues.
  • Peer-to-Peer Decentralized — nodes share discovered URLs without a central coordinator.
  • Cloud / Serverless Uses scalable, ephemeral functions to perform crawl tasks.

By Data Access Method

Defines where and how data is retrieved.

Type Description

  • HTML / Page Crawler Fetches and parses HTML pages.
  • API Crawler Pulls structured data through APIs.
  • Headless / Rendered Crawler Uses browsers or headless engines to handle JavaScript.
  • Hybrid Crawler Combines HTML, API, and headless methods adaptively.

By Ethical or Policy Behavior

Defines how the crawler interacts with site policies.

Type Description

  • Polite Crawler Obeys robots.txt, rate limits, and crawl delays.
  • Aggressive Crawler Ignores some restrictions (not recommended).
  • Authenticated Crawler Operates within login-required environments.

By Intelligence Level

Defines how much decision-making or AI the crawler uses.

Type Description

  • Rule-Based Follows fixed rules or regex filters.
  • Heuristic-Based Uses handcrafted scoring or relevance functions.
  • ML-Guided Uses machine learning for link scoring or prioritization.
  • LLM-Guided Uses large language models to interpret pages and guide navigation.

By Integration Role (System Position)

Defines how the crawler fits into the larger data pipeline.

Type Description

  • Standalone Crawler - Operates independently, outputs raw pages or URLs.
  • Coupled Crawler - Integrates scraping logic directly inside the crawl loop.
  • Modular Crawler - Works with separate scraper, parser, and storage components.
  • Streaming Crawler - Feeds data continuously into real-time pipelines (Kafka, etc.).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartspider-0.0.2.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartspider-0.0.2-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file smartspider-0.0.2.tar.gz.

File metadata

  • Download URL: smartspider-0.0.2.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for smartspider-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b95b5f07698051b02e9c7c86b4a60c9a18f3668f3ed0efb859da1b030fa12f78
MD5 d93aec71b5ef62d991413d7b59543710
BLAKE2b-256 2576440c25ec451a5512706d8b6c34d0c05b4105e5838cc450728aeb9395f2f9

See more details on using hashes here.

File details

Details for the file smartspider-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: smartspider-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for smartspider-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89de17a58dca45a3f5a88a75c12131941a9598743d37a46ed1a1e34748487c6a
MD5 637fbb30e2f57d570d0f9907a48ed78b
BLAKE2b-256 6116d750c278fa64e8923cfd7d58473a3ea6a4e38511852b469440bd2e029249

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page