CrawlerX - The Ultimate Web Crawler
Project description
CrawlerX โ The Ultimate Web Crawler
CrawlerX is a command-line tool designed for security researchers and penetration testers to perform comprehensive web crawling for reconnaissance. It extracts URLs, POST data, directories, files, and resources, while respecting robots.txt and supporting customizable configurations for depth, threading, and output formatting.
โจ Developed by @IMApurbo
๐ก๏ธ Use responsibly. Authorized testing only.
Features
-
Comprehensive Crawling
Extracts GET URLs with query parameters, POST requests, directories, files, and resources (images, scripts, CSS, etc.). -
Robots.txt Compliance
Respects website robots.txt rules to avoid crawling restricted areas. -
Customizable Crawling
- Configurable crawling depth (
--depth). - Adjustable delay between requests (
--delay). - Support for crawling subdomains (
--sub). - Exclude specific file extensions (
--exclude).
- Configurable crawling depth (
-
HTTP Customization
- Custom User-Agent (
--ua). - Custom headers (
-H/--headers). - Proxy support (
--proxy).
- Custom User-Agent (
-
Output Options
- Save results to organized directories (
-o/--output). - Generate ASCII site structure tree (
--structure). - Export results in TXT and JSON formats.
- Save results to organized directories (
-
Resource Extraction
Categorizes resources like images, scripts, and CSS for easy analysis. -
Resumable Crawling
Save and resume crawl state using pickle files (--cont). -
Threading Support
Concurrent crawling with adjustable threads (--threads, max 20). -
Robust Error Handling
Handles network errors, timeouts, and invalid URLs gracefully. -
User-Friendly Output
Detailed console logs with URL types, status, and depth, plus structured file outputs.
Installation
pip install crawlerx
Usage
crawlerx -u <url> [options]
Common Flags
| Short | Long | Description | Required | Default |
|---|---|---|---|---|
-u |
--url |
Target URL (e.g., https://example.com) |
โ | - |
-o |
--output |
Output directory for results | โ | None (prints to terminal) |
--structure |
Generate ASCII site structure | โ | False | |
-H |
--headers |
Custom headers (e.g., Cookie:session=abc;Auth:xyz) |
โ | None |
--threads |
Number of concurrent threads (1-20) | โ | 1 | |
--depth |
Maximum crawling depth | โ | 3 | |
--ua |
Custom User-Agent string | โ | Spidar/1.0 |
|
--exclude |
Comma-separated extensions to exclude | โ | jpg,jpeg,png,gif,pdf,css,js |
|
--sub |
Include subdomains in crawling | โ | False | |
--proxy |
Proxy server (e.g., http://proxy:port) |
โ | None | |
--timeout |
Request timeout in seconds | โ | 5 | |
--delay |
Delay between requests in seconds | โ | 1.0 | |
--cont |
Path to crawl state pickle file to resume | โ | None | |
-h |
--help |
Show help message and exit | โ | - |
Examples
Basic Crawl:
crawlerx -u https://example.com
Crawl with Output Directory:
crawlerx -u https://example.com -o ./results
Generate Site Structure:
crawlerx -u https://example.com --structure
Crawl with Custom Headers and Proxy:
crawlerx -u https://example.com -H "Cookie:session=abc123;Auth:Bearer xyz" --proxy http://proxy:8080
Resume Crawl from State:
crawlerx -u https://example.com --cont ./results/spider_example.com/crawl_state.pkl
Crawl Subdomains with Increased Threads:
crawlerx -u https://example.com --sub --threads 10
Exclude Specific Extensions:
crawlerx -u https://example.com --exclude pdf,docx
Output Format
When using -o/--output, results are saved in a directory named spider_<domain> with the following structure:
spider_<domain>/
โโโ get/
โ โโโ get_requests.txt
โ โโโ get_requests.json
โโโ post/
โ โโโ post_requests.txt
โ โโโ post_requests.json
โโโ dir/
โ โโโ dirs.txt
โโโ files/
โ โโโ files.txt
โ โโโ images.txt
โ โโโ images.json
โ โโโ scripts.txt
โ โโโ scripts.json
โ โโโ css.txt
โ โโโ css.json
โ โโโ other.txt
โ โโโ other.json
โโโ structure/
โ โโโ structure.txt
โโโ crawl_state.pkl
Legal Notice
Use only on systems you have explicit permission to test.
Misuse may violate laws and ethical guidelines.
Credits
- Developed by IMApurbo
๐ License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlerx-1.1.0-py3-none-any.whl.
File metadata
- Download URL: crawlerx-1.1.0-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90ce247a6bf8c5aebedb791d6393b54392527329a62757fe521740bb29c0bcc8
|
|
| MD5 |
316801f0d2edb9532f51794ff58e24ac
|
|
| BLAKE2b-256 |
947c9e9107a4910b740a28f9b2229ec6aea44211938407cd06ac5d1bb1f98b3e
|