A powerful, pure Python tool for extracting all URLs and extensions from target websites with WAF bypass
Project description
Pure Python URL Extractor v5.0
A powerful, standalone Python tool for extracting all URLs and file extensions from a target website without relying on external tools like gau or urlfinder.
Author: ArkhAngelLifeJiggy
Features
-
🎯 Comprehensive URL Discovery: Extracts URLs from multiple sources:
- Wayback Machine (historical URLs)
- Common Crawl (archived URLs)
- Live website crawling
- JavaScript file analysis
-
📊 Advanced URL Categorization:
- Known URLs: Publicly available URLs from archives
- Hidden URLs: URLs found in JavaScript files and comments
- Internal URLs: URLs within the same domain
- External URLs: URLs pointing to other domains
-
🔍 Categorized File Extension Analysis:
- JavaScript:
.js,.jsx,.ts,.coffee,.vue - HTML:
.html,.htm,.xhtml,.xml - CSS:
.css,.scss,.sass,.less - Images:
.png,.jpg,.jpeg,.gif,.webp,.svg,.ico - Documents:
.pdf,.doc,.docx,.txt,.md,.rtf - Archives:
.zip,.rar,.tar,.gz,.7z,.bz2 - Media:
.mp4,.mp3,.avi,.mov,.wmv,.flv - Other: All remaining extensions
- JavaScript:
-
🛡️ Advanced WAF Bypass & Anti-Detection:
- User agent rotation (5+ realistic browsers)
- IP spoofing with X-Forwarded-For headers
- Smart delaying mechanisms with randomization
- Custom proxy support (HTTP/HTTPS)
- Cache control and connection header randomization
- Request retry mechanism with exponential backoff
-
✅ Advanced Validation & Filtering:
- False positive detection and removal (
data:,javascript:,mailto:) - MD5-based duplicate prevention with hash collision detection
- URL normalization and validation with regex patterns
- Domain-based filtering with subdomain support
- Extension-based filtering (include/exclude specific file types)
- Pattern matching with regex include/exclude rules
- False positive detection and removal (
-
📝 Comprehensive Logging System:
- Automatic timestamped log file generation
- INFO, WARNING, ERROR level logging with context
- Detailed operation tracking with performance metrics
- Custom log file path support
- Structured logging for debugging and analysis
-
🎨 Beautiful Interactive Interface:
- Colorful terminal output with 8-color support
- Real-time progress indicators and status updates
- Professional ASCII banner with author credits
- Categorized results display with statistics
- Quiet mode and stats-only output options
-
📊 Multiple Output Formats:
- JSON: Structured data with categorized results and metadata
- CSV: Spreadsheet-compatible format with URL, source, type, extension columns
- XML: Machine-readable format with proper schema
- Statistics Only: Summary output for automation
- Custom Formatting: Flexible output options
-
⚙️ Advanced Configuration Options (20+ parameters):
- Crawling:
max-pages,max-depth,concurrency,timeout - Security:
waf-bypass,user-agent,proxy,delay,retries - Filtering:
exclude-extensions,include-only,exclude-pattern - Output:
csv,xml,quiet,stats-only,save-html - Debugging:
verbose,log-file,no-color
- Crawling:
-
⚡ Pure Python Implementation: No external dependencies on Go tools
-
🔧 Configurable Parameters: Fine-grained control over all aspects
-
📦 PyPI Package: Ready for
pip install url-extractor
Installation
- Install Python dependencies:
pip install -r requirements.txt
Usage
Basic Usage
python url_extractor.py https://example.com
Advanced Usage with WAF Bypass
python url_extractor.py https://example.com \
--output results.json \
--verbose \
--waf-bypass \
--delay 2.0 \
--max-pages 200 \
--max-depth 4 \
--threads 20
Command Line Options
| Option | Description | Default |
|---|---|---|
target_url |
Target URL to extract URLs from (required) | - |
-o, --output |
Output file for results (JSON/CSV/XML) | - |
-v, --verbose |
Enable verbose output | False |
-p, --max-pages |
Maximum pages to crawl | 100 |
-d, --max-depth |
Maximum crawl depth | 3 |
-t, --threads |
Number of threads for concurrent processing | 10 |
--delay |
Delay between requests in seconds | 1.0 |
--waf-bypass |
Enable WAF bypass techniques | False |
--user-agent |
Custom user agent string | Random |
--proxy |
HTTP proxy (http://proxy:port) | - |
--timeout |
Request timeout in seconds | 30 |
--max-js-files |
Maximum JS files to analyze | 20 |
--concurrency |
Number of concurrent requests | 5 |
--retries |
Number of retries for failed requests | 3 |
--exclude-extensions |
Comma-separated extensions to exclude | - |
--include-only |
Include only URLs containing this string | - |
--exclude-pattern |
Exclude URLs matching regex pattern | - |
--save-html |
Save HTML responses for analysis | False |
--quiet |
Suppress all output except results | False |
--stats-only |
Show only statistics, no URLs | False |
--csv |
Output in CSV format | False |
--xml |
Output in XML format | False |
--log-file |
Custom log file path | Auto-generated |
--no-color |
Disable colored output | False |
Output
The tool provides a comprehensive summary including:
- Total number of unique URLs found
- URLs categorized by type (known, hidden, internal, external)
- File extensions organized by category
- Top extensions with counts
- Optional JSON export with all URLs and metadata
JSON Output Format
{
"target_url": "https://example.com",
"domain": "example.com",
"timestamp": "2025-09-17T13:00:00.000000",
"total_urls": 150,
"categorized_urls": {
"known": ["https://example.com/page1", ...],
"hidden": ["https://example.com/api/secret", ...],
"internal": ["https://example.com/about", ...],
"external": ["https://cdn.example.com/script.js", ...]
},
"categorized_extensions": {
"javascript": {"js": 25, "jsx": 5},
"html": {"html": 10, "xml": 2},
"css": {"css": 15, "scss": 3},
"images": {"png": 45, "jpg": 30},
"documents": {"pdf": 5, "txt": 8},
"archives": {"zip": 2, "gz": 1},
"media": {"mp4": 3, "mp3": 7},
"other": {"json": 12, "xml": 8}
},
"all_urls": ["https://example.com/", ...],
"statistics": {
"total_known": 120,
"total_hidden": 15,
"total_internal": 135,
"total_external": 15,
"total_extensions": 89
}
}
How It Works
- Wayback Machine: Queries the Internet Archive for historical URLs
- Common Crawl: Fetches URLs from the Common Crawl index
- Website Crawling: Performs breadth-first crawling of the target site
- JavaScript Analysis: Parses JS files to extract embedded URLs
- URL Normalization: Converts relative URLs to absolute URLs
- Validation & Filtering: Removes false positives and duplicates
- Categorization: Classifies URLs by type and source
- Extension Analysis: Groups file extensions by category
Examples
Extract URLs from a website with WAF bypass
python url_extractor.py https://bugcrowd.com --waf-bypass --delay 1.5 -v
Deep crawl with custom logging
python url_extractor.py https://example.com \
-v -p 500 -d 5 \
-o results.json \
--log-file custom_scan.log
Quick scan with minimal resources
python url_extractor.py https://example.com \
-p 50 -d 2 -t 5 \
--no-color
Full security testing mode
python url_extractor.py https://target.com \
--waf-bypass \
--delay 3.0 \
--max-pages 1000 \
--max-depth 5 \
--threads 30 \
--verbose \
--output security_scan.json
Advanced filtering and output options
# Filter specific file types and use multiple output formats
python url_extractor.py https://site.com \
--exclude-extensions pdf,doc,zip,rar \
--include-only api \
--csv --xml --json \
--proxy http://127.0.0.1:8080 \
--user-agent "Custom Security Scanner v1.0"
# Quiet mode for automation with custom logging
python url_extractor.py https://target.com \
--quiet \
--stats-only \
--log-file /var/log/url_scan.log \
--exclude-pattern "\.(jpg|png|gif)$" \
--concurrency 10 \
--timeout 60
PyPI Installation (Future)
pip install url-extractor-pro
url-extractor-pro https://example.com --waf-bypass
Dependencies
requests>=2.25.0: HTTP client for web requestsbeautifulsoup4>=4.9.0: HTML parsinglxml>=4.6.0: XML/HTML parser (optional, for better performance)
Logging
The tool automatically creates detailed log files with timestamps:
2025-09-17 14:07:08,349 - INFO - Starting URL Extractor v5.0 by ArkhAngelLifeJiggy
2025-09-17 14:07:08,350 - INFO - Target: https://bugcrowd.com
2025-09-17 14:08:32,058 - INFO - Wayback Machine: 201482 valid URLs found
2025-09-17 14:08:32,974 - INFO - Common Crawl: 0 valid URLs found
2025-09-17 14:08:35,831 - INFO - Live Crawling: 140 valid URLs found
WAF Bypass Features
- User Agent Rotation: Cycles through 5+ realistic browser signatures
- IP Spoofing: Randomized X-Forwarded-For and X-Real-IP headers
- Smart Delaying: Configurable delays with randomization to avoid detection
- Header Randomization: Varies cache-control, connection, and other headers
- Request Pattern Variation: Mimics human browsing behavior
Validation & Filtering
-
False Positive Detection: Automatically filters out:
data:URLsjavascript:URLsmailto:linkstel:links- Fragment-only URLs (#)
- Chrome/Safari internal URLs
-
Duplicate Prevention: MD5 hash-based deduplication ensures no repeated URLs
-
URL Validation: Ensures all URLs have valid schemes and netloc
-
Domain Filtering: Respects same-domain boundaries for internal/external classification
License
This project is open source and available under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
Disclaimer
This tool is for educational and research purposes only. Always respect website terms of service and robots.txt files when crawling websites. Use responsibly and ethically.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file url_extractor_pro-5.0.0.tar.gz.
File metadata
- Download URL: url_extractor_pro-5.0.0.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cbf3a9eb57750ae85c2bef01107a63053fadd0555af82d0ac806f1a38d17b5f
|
|
| MD5 |
4a8fca5d99104a841a2090e9d4edda19
|
|
| BLAKE2b-256 |
34f8aea2f89d4968985459a49ce30ae4b04539122f0b751f093e0930b034189b
|
File details
Details for the file url_extractor_pro-5.0.0-py3-none-any.whl.
File metadata
- Download URL: url_extractor_pro-5.0.0-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2666796a16c1d52df7d2136e2be8b0c30891ab059f9ba03e35ee764f6a384733
|
|
| MD5 |
9bb4a35dff409f864250db086125b74f
|
|
| BLAKE2b-256 |
d3fecfbf846db2adb9cc57ceae9b6729b5eb40665ac8531146a8f852ea8c7bbd
|