SmartSpider

A modular adaptive web crawler framework with incremental, continuous, and scope-limited crawling support.

These details have not been verified by PyPI

Project links

Project description

Summary — The 9 Foundational Crawler Categories

Category	Core Question It Answers	Example Type
Purpose	Why is it crawling?	Scraping crawler
Crawl Strategy	How does it traverse pages?	BFS, DFS, Focused
Scheduling Behavior	When does it crawl?	Incremental, Continuous
Architecture	How is it organized?	Distributed crawler
Scope	How wide does it crawl?	Site-wide crawler
Data Access Method	Where does it get data from?	API crawler
Ethical Behavior	How does it treat site rules?	Polite crawler
Intelligence Level	How does it make decisions?	LLM-guided crawler
Integration Role	How does it fit into a pipeline?	Modular crawler

By Purpose (Goal or Intent)

Defines why the crawler exists — what it aims to achieve.

Type Description

Discovery Crawler - Finds and collects URLs or metadata; builds an index or frontier.
Scraping Crawler - Extracts structured data from pages (e.g., tables, text, entities).
Monitoring Crawler - Tracks updates or changes in content over time.
Archival Crawler - Saves full page copies for preservation or offline analysis.
Testing / Auditing Crawler - Used for SEO, broken-link checking, or site compliance validation.

By Crawl Strategy (Traversal Logic)

Defines how pages are selected or ordered during crawling.

Type Description

Breadth-First (BFS) - Crawl all pages at one depth before moving deeper.
Depth-First (DFS) - Crawl one path as deep as possible before backtracking.
Priority-Based - Assign numerical priority scores to URLs.
Adaptive Adjust - strategy dynamically based on feedback or results.
Context-Aware - Use HTML structure and semantics to guide crawling decisions.

By Scheduling Behavior (Temporal Logic)

Defines when and how often the crawler operates.

Type Description

One-Shot - Runs once and stops after completion.
Incremental - Re-crawls known pages periodically to detect updates.
Continuous - Never stops constantly cycles through crawl/revisit loops.
Event-Driven - Triggered by signals such as webhooks or detected changes.

By Scope (Coverage Target)

Defines how broadly the crawler explores the web.

Type Description

Focused - only relevant pages based on topic or keyword.
Site-Wide - Restricted to one domain or subdomain.
Multi-Domain - Crawls a fixed list of domains.
Vertical / Domain-Specific - Focused on one industry (e.g., sports, jobs, e-commerce).
Web-Scale - Crawls the entire public web; search-engine level.

By Architecture (System Design)

Defines how the crawler is built and organized internally.

Type-Description

Centralized Single control node managing all crawl tasks.
Distributed Multiple coordinated nodes sharing workload and URL queues.
Peer-to-Peer Decentralized — nodes share discovered URLs without a central coordinator.
Cloud / Serverless Uses scalable, ephemeral functions to perform crawl tasks.

By Data Access Method

Defines where and how data is retrieved.

Type Description

HTML / Page Crawler Fetches and parses HTML pages.
API Crawler Pulls structured data through APIs.
Headless / Rendered Crawler Uses browsers or headless engines to handle JavaScript.
Hybrid Crawler Combines HTML, API, and headless methods adaptively.

By Ethical or Policy Behavior

Defines how the crawler interacts with site policies.

Type Description

Polite Crawler Obeys robots.txt, rate limits, and crawl delays.
Aggressive Crawler Ignores some restrictions (not recommended).
Authenticated Crawler Operates within login-required environments.

By Intelligence Level

Defines how much decision-making or AI the crawler uses.

Type Description

Rule-Based Follows fixed rules or regex filters.
Heuristic-Based Uses handcrafted scoring or relevance functions.
ML-Guided Uses machine learning for link scoring or prioritization.
LLM-Guided Uses large language models to interpret pages and guide navigation.

By Integration Role (System Position)

Defines how the crawler fits into the larger data pipeline.

Type Description

Standalone Crawler - Operates independently, outputs raw pages or URLs.
Coupled Crawler - Integrates scraping logic directly inside the crawl loop.
Modular Crawler - Works with separate scraper, parser, and storage components.
Streaming Crawler - Feeds data continuously into real-time pipelines (Kafka, etc.).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.3

Nov 12, 2025

This version

0.0.2

Oct 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartspider-0.0.2.tar.gz (10.4 kB view details)

Uploaded Oct 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smartspider-0.0.2-py3-none-any.whl (9.8 kB view details)

Uploaded Oct 24, 2025 Python 3

File details

Details for the file smartspider-0.0.2.tar.gz.

File metadata

Download URL: smartspider-0.0.2.tar.gz
Upload date: Oct 24, 2025
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for smartspider-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b95b5f07698051b02e9c7c86b4a60c9a18f3668f3ed0efb859da1b030fa12f78`
MD5	`d93aec71b5ef62d991413d7b59543710`
BLAKE2b-256	`2576440c25ec451a5512706d8b6c34d0c05b4105e5838cc450728aeb9395f2f9`

See more details on using hashes here.

File details

Details for the file smartspider-0.0.2-py3-none-any.whl.

File metadata

Download URL: smartspider-0.0.2-py3-none-any.whl
Upload date: Oct 24, 2025
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for smartspider-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89de17a58dca45a3f5a88a75c12131941a9598743d37a46ed1a1e34748487c6a`
MD5	`637fbb30e2f57d570d0f9907a48ed78b`
BLAKE2b-256	`6116d750c278fa64e8923cfd7d58473a3ea6a4e38511852b469440bd2e029249`

See more details on using hashes here.

SmartSpider 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Summary — The 9 Foundational Crawler Categories

By Purpose (Goal or Intent)

Type Description

By Crawl Strategy (Traversal Logic)

Type Description

By Scheduling Behavior (Temporal Logic)

Type Description

By Scope (Coverage Target)

Type Description

By Architecture (System Design)

Type-Description

By Data Access Method

Type Description

By Ethical or Policy Behavior

Type Description

By Intelligence Level

Type Description

By Integration Role (System Position)

Type Description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes