CrawlerX - The Ultimate Web Crawler

These details have not been verified by PyPI

Project links

Homepage

Project description

CrawlerX – The Ultimate Web Crawler

CrawlerX is a command-line tool designed for security researchers and penetration testers to perform comprehensive web crawling for reconnaissance. It extracts URLs, POST data, directories, files, and resources, while respecting robots.txt and supporting customizable configurations for depth, threading, and output formatting.

✨ Developed by @IMApurbo
🛡️ Use responsibly. Authorized testing only.

Features

Comprehensive Crawling
Extracts GET URLs with query parameters, POST requests, directories, files, and resources (images, scripts, CSS, etc.).
Robots.txt Compliance
Respects website robots.txt rules to avoid crawling restricted areas.
Customizable Crawling
- Configurable crawling depth (--depth).
- Adjustable delay between requests (--delay).
- Support for crawling subdomains (--sub).
- Exclude specific file extensions (--exclude).
HTTP Customization
- Custom User-Agent (--ua).
- Custom headers (-H/--headers).
- Proxy support (--proxy).
Output Options
- Save results to organized directories (-o/--output).
- Generate ASCII site structure tree (--structure).
- Export results in TXT and JSON formats.
Resource Extraction
Categorizes resources like images, scripts, and CSS for easy analysis.
Resumable Crawling
Save and resume crawl state using pickle files (--cont).
Threading Support
Concurrent crawling with adjustable threads (--threads, max 20).
Robust Error Handling
Handles network errors, timeouts, and invalid URLs gracefully.
User-Friendly Output
Detailed console logs with URL types, status, and depth, plus structured file outputs.

Installation

pip install crawlerx

Usage

crawlerx -u <url> [options]

Common Flags

Short	Long	Description	Required	Default
`-u`	`--url`	Target URL (e.g., `https://example.com`)	✅	-
`-o`	`--output`	Output directory for results	❌	None (prints to terminal)
	`--structure`	Generate ASCII site structure	❌	False
`-H`	`--headers`	Custom headers (e.g., `Cookie:session=abc;Auth:xyz`)	❌	None
	`--threads`	Number of concurrent threads (1-20)	❌	1
	`--depth`	Maximum crawling depth	❌	3
	`--ua`	Custom User-Agent string	❌	`Spidar/1.0`
	`--exclude`	Comma-separated extensions to exclude	❌	`jpg,jpeg,png,gif,pdf,css,js`
	`--sub`	Include subdomains in crawling	❌	False
	`--proxy`	Proxy server (e.g., `http://proxy:port`)	❌	None
	`--timeout`	Request timeout in seconds	❌	5
	`--delay`	Delay between requests in seconds	❌	1.0
	`--cont`	Path to crawl state pickle file to resume	❌	None
`-h`	`--help`	Show help message and exit	❌	-

Examples

Basic Crawl:

crawlerx -u https://example.com

Crawl with Output Directory:

crawlerx -u https://example.com -o ./results

Generate Site Structure:

crawlerx -u https://example.com --structure

Crawl with Custom Headers and Proxy:

crawlerx -u https://example.com -H "Cookie:session=abc123;Auth:Bearer xyz" --proxy http://proxy:8080

Resume Crawl from State:

crawlerx -u https://example.com --cont ./results/spider_example.com/crawl_state.pkl

Crawl Subdomains with Increased Threads:

crawlerx -u https://example.com --sub --threads 10

Exclude Specific Extensions:

crawlerx -u https://example.com --exclude pdf,docx

Output Format

When using -o/--output, results are saved in a directory named spider_<domain> with the following structure:

spider_<domain>/
├── get/
│   ├── get_requests.txt
│   └── get_requests.json
├── post/
│   ├── post_requests.txt
│   └── post_requests.json
├── dir/
│   └── dirs.txt
├── files/
│   ├── files.txt
│   ├── images.txt
│   ├── images.json
│   ├── scripts.txt
│   ├── scripts.json
│   ├── css.txt
│   ├── css.json
│   ├── other.txt
│   ├── other.json
├── structure/
│   └── structure.txt
└── crawl_state.pkl

Legal Notice

Use only on systems you have explicit permission to test.
Misuse may violate laws and ethical guidelines.

Credits

Developed by IMApurbo

📃 License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.1.1

Dec 13, 2025

This version

1.1.0

Jun 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawlerx-1.1.0-py3-none-any.whl (9.8 kB view details)

Uploaded Jun 18, 2025 Python 3

File details

Details for the file crawlerx-1.1.0-py3-none-any.whl.

File metadata

Download URL: crawlerx-1.1.0-py3-none-any.whl
Upload date: Jun 18, 2025
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for crawlerx-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90ce247a6bf8c5aebedb791d6393b54392527329a62757fe521740bb29c0bcc8`
MD5	`316801f0d2edb9532f51794ff58e24ac`
BLAKE2b-256	`947c9e9107a4910b740a28f9b2229ec6aea44211938407cd06ac5d1bb1f98b3e`

See more details on using hashes here.

crawlerx 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CrawlerX – The Ultimate Web Crawler

Features

Installation

Usage

Common Flags

Examples

Output Format

Legal Notice

Credits

📃 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes