A modular adaptive web crawler framework with incremental, continuous, and scope-limited crawling support.
Project description
Summary — The 9 Foundational Crawler Categories
| Category | Core Question It Answers | Example Type |
|---|---|---|
| Purpose | Why is it crawling? | Scraping crawler |
| Crawl Strategy | How does it traverse pages? | BFS, DFS, Focused |
| Scheduling Behavior | When does it crawl? | Incremental, Continuous |
| Architecture | How is it organized? | Distributed crawler |
| Scope | How wide does it crawl? | Site-wide crawler |
| Data Access Method | Where does it get data from? | API crawler |
| Ethical Behavior | How does it treat site rules? | Polite crawler |
| Intelligence Level | How does it make decisions? | LLM-guided crawler |
| Integration Role | How does it fit into a pipeline? | Modular crawler |
By Purpose (Goal or Intent)
Defines why the crawler exists — what it aims to achieve.
Type Description
- Discovery Crawler - Finds and collects URLs or metadata; builds an index or frontier.
- Scraping Crawler - Extracts structured data from pages (e.g., tables, text, entities).
- Monitoring Crawler - Tracks updates or changes in content over time.
- Archival Crawler - Saves full page copies for preservation or offline analysis.
- Testing / Auditing Crawler - Used for SEO, broken-link checking, or site compliance validation.
By Crawl Strategy (Traversal Logic)
Defines how pages are selected or ordered during crawling.
Type Description
- Breadth-First (BFS) - Crawl all pages at one depth before moving deeper.
- Depth-First (DFS) - Crawl one path as deep as possible before backtracking.
- Priority-Based - Assign numerical priority scores to URLs.
- Adaptive Adjust - strategy dynamically based on feedback or results.
- Context-Aware - Use HTML structure and semantics to guide crawling decisions.
By Scheduling Behavior (Temporal Logic)
Defines when and how often the crawler operates.
Type Description
- One-Shot - Runs once and stops after completion.
- Incremental - Re-crawls known pages periodically to detect updates.
- Continuous - Never stops constantly cycles through crawl/revisit loops.
- Event-Driven - Triggered by signals such as webhooks or detected changes.
By Scope (Coverage Target)
Defines how broadly the crawler explores the web.
Type Description
- Focused - only relevant pages based on topic or keyword.
- Site-Wide - Restricted to one domain or subdomain.
- Multi-Domain - Crawls a fixed list of domains.
- Vertical / Domain-Specific - Focused on one industry (e.g., sports, jobs, e-commerce).
- Web-Scale - Crawls the entire public web; search-engine level.
By Architecture (System Design)
Defines how the crawler is built and organized internally.
Type-Description
- Centralized Single control node managing all crawl tasks.
- Distributed Multiple coordinated nodes sharing workload and URL queues.
- Peer-to-Peer Decentralized — nodes share discovered URLs without a central coordinator.
- Cloud / Serverless Uses scalable, ephemeral functions to perform crawl tasks.
By Data Access Method
Defines where and how data is retrieved.
Type Description
- HTML / Page Crawler Fetches and parses HTML pages.
- API Crawler Pulls structured data through APIs.
- Headless / Rendered Crawler Uses browsers or headless engines to handle JavaScript.
- Hybrid Crawler Combines HTML, API, and headless methods adaptively.
By Ethical or Policy Behavior
Defines how the crawler interacts with site policies.
Type Description
- Polite Crawler Obeys robots.txt, rate limits, and crawl delays.
- Aggressive Crawler Ignores some restrictions (not recommended).
- Authenticated Crawler Operates within login-required environments.
By Intelligence Level
Defines how much decision-making or AI the crawler uses.
Type Description
- Rule-Based Follows fixed rules or regex filters.
- Heuristic-Based Uses handcrafted scoring or relevance functions.
- ML-Guided Uses machine learning for link scoring or prioritization.
- LLM-Guided Uses large language models to interpret pages and guide navigation.
By Integration Role (System Position)
Defines how the crawler fits into the larger data pipeline.
Type Description
- Standalone Crawler - Operates independently, outputs raw pages or URLs.
- Coupled Crawler - Integrates scraping logic directly inside the crawl loop.
- Modular Crawler - Works with separate scraper, parser, and storage components.
- Streaming Crawler - Feeds data continuously into real-time pipelines (Kafka, etc.).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smartspider-0.0.2.tar.gz.
File metadata
- Download URL: smartspider-0.0.2.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b95b5f07698051b02e9c7c86b4a60c9a18f3668f3ed0efb859da1b030fa12f78
|
|
| MD5 |
d93aec71b5ef62d991413d7b59543710
|
|
| BLAKE2b-256 |
2576440c25ec451a5512706d8b6c34d0c05b4105e5838cc450728aeb9395f2f9
|
File details
Details for the file smartspider-0.0.2-py3-none-any.whl.
File metadata
- Download URL: smartspider-0.0.2-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89de17a58dca45a3f5a88a75c12131941a9598743d37a46ed1a1e34748487c6a
|
|
| MD5 |
637fbb30e2f57d570d0f9907a48ed78b
|
|
| BLAKE2b-256 |
6116d750c278fa64e8923cfd7d58473a3ea6a4e38511852b469440bd2e029249
|