Skip to main content

CrawlerX - The Ultimate Web Crawler

Project description

CrawlerX โ€“ The Ultimate Web Crawler

CrawlerX is a command-line tool designed for security researchers and penetration testers to perform comprehensive web crawling for reconnaissance. It extracts URLs, POST data, directories, files, and resources, while respecting robots.txt and supporting customizable configurations for depth, threading, and output formatting.

โœจ Developed by @IMApurbo
๐Ÿ›ก๏ธ Use responsibly. Authorized testing only.


Features

  • Comprehensive Crawling
    Extracts GET URLs with query parameters, POST requests, directories, files, and resources (images, scripts, CSS, etc.).

  • Robots.txt Compliance
    Respects website robots.txt rules to avoid crawling restricted areas.

  • Customizable Crawling

    • Configurable crawling depth (--depth).
    • Adjustable delay between requests (--delay).
    • Support for crawling subdomains (--sub).
    • Exclude specific file extensions (--exclude).
  • HTTP Customization

    • Custom User-Agent (--ua).
    • Custom headers (-H/--headers).
    • Proxy support (--proxy).
  • Output Options

    • Save results to organized directories (-o/--output).
    • Generate ASCII site structure tree (--structure).
    • Export results in TXT and JSON formats.
  • Resource Extraction
    Categorizes resources like images, scripts, and CSS for easy analysis.

  • Resumable Crawling
    Save and resume crawl state using pickle files (--cont).

  • Threading Support
    Concurrent crawling with adjustable threads (--threads, max 20).

  • Robust Error Handling
    Handles network errors, timeouts, and invalid URLs gracefully.

  • User-Friendly Output
    Detailed console logs with URL types, status, and depth, plus structured file outputs.


Installation

pip install crawlerx

Usage

crawlerx -u <url> [options]

Common Flags

Short Long Description Required Default
-u --url Target URL (e.g., https://example.com) โœ… -
-o --output Output directory for results โŒ None (prints to terminal)
--structure Generate ASCII site structure โŒ False
-H --headers Custom headers (e.g., Cookie:session=abc;Auth:xyz) โŒ None
--threads Number of concurrent threads (1-20) โŒ 1
--depth Maximum crawling depth โŒ 3
--ua Custom User-Agent string โŒ Spidar/1.0
--exclude Comma-separated extensions to exclude โŒ jpg,jpeg,png,gif,pdf,css,js
--sub Include subdomains in crawling โŒ False
--proxy Proxy server (e.g., http://proxy:port) โŒ None
--timeout Request timeout in seconds โŒ 5
--delay Delay between requests in seconds โŒ 1.0
--cont Path to crawl state pickle file to resume โŒ None
-h --help Show help message and exit โŒ -

Examples

Basic Crawl:

crawlerx -u https://example.com

Crawl with Output Directory:

crawlerx -u https://example.com -o ./results

Generate Site Structure:

crawlerx -u https://example.com --structure

Crawl with Custom Headers and Proxy:

crawlerx -u https://example.com -H "Cookie:session=abc123;Auth:Bearer xyz" --proxy http://proxy:8080

Resume Crawl from State:

crawlerx -u https://example.com --cont ./results/spider_example.com/crawl_state.pkl

Crawl Subdomains with Increased Threads:

crawlerx -u https://example.com --sub --threads 10

Exclude Specific Extensions:

crawlerx -u https://example.com --exclude pdf,docx

Output Format

When using -o/--output, results are saved in a directory named spider_<domain> with the following structure:

spider_<domain>/
โ”œโ”€โ”€ get/
โ”‚   โ”œโ”€โ”€ get_requests.txt
โ”‚   โ””โ”€โ”€ get_requests.json
โ”œโ”€โ”€ post/
โ”‚   โ”œโ”€โ”€ post_requests.txt
โ”‚   โ””โ”€โ”€ post_requests.json
โ”œโ”€โ”€ dir/
โ”‚   โ””โ”€โ”€ dirs.txt
โ”œโ”€โ”€ files/
โ”‚   โ”œโ”€โ”€ files.txt
โ”‚   โ”œโ”€โ”€ images.txt
โ”‚   โ”œโ”€โ”€ images.json
โ”‚   โ”œโ”€โ”€ scripts.txt
โ”‚   โ”œโ”€โ”€ scripts.json
โ”‚   โ”œโ”€โ”€ css.txt
โ”‚   โ”œโ”€โ”€ css.json
โ”‚   โ”œโ”€โ”€ other.txt
โ”‚   โ”œโ”€โ”€ other.json
โ”œโ”€โ”€ structure/
โ”‚   โ””โ”€โ”€ structure.txt
โ””โ”€โ”€ crawl_state.pkl


Legal Notice

Use only on systems you have explicit permission to test.
Misuse may violate laws and ethical guidelines.


Credits

  • Developed by IMApurbo

๐Ÿ“ƒ License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlerx-1.1.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file crawlerx-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: crawlerx-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for crawlerx-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 90ce247a6bf8c5aebedb791d6393b54392527329a62757fe521740bb29c0bcc8
MD5 316801f0d2edb9532f51794ff58e24ac
BLAKE2b-256 947c9e9107a4910b740a28f9b2229ec6aea44211938407cd06ac5d1bb1f98b3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page