Skip to main content

This module implements a crawler to find all links and resources

Project description

Cr0wl3r logo

Cr0wl3r

Description

This package implements a web crawler to find all visible URLs in the page and sort URL by dynamic content (useful for pentest and hacking).

Requirements

This package require:

  • python3
  • python3 Standard Library

Optional:

  • Selenium

Installation

pip install Cr0wl3r 

Usages

Command lines

# Python executable
python3 Cr0wl3r.pyz -h
# or
chmod u+x Cr0wl3r.pyz
./Cr0wl3r.pyz --help

# Python module
python3 -m Cr0wl3r https://github.com/mauricelambert

# Entry point (console)
Cr0wl3r -F report.json -l DEBUG -f logs.log -R -S -d -c "cookie=foobar" -H "User-Agent:Chrome" -m 3 -t p -r https://github.com/mauricelambert

Python3

from Cr0wl3r import CrawlerPrinter
from logging import basicConfig

basicConfig(level=1)

cr0wl3r = CrawlerPrinter(
    "https://github.com/mauricelambert",
    recursive = False,
    max_request = None,
    only_domain = True,
    headers = {},
    robots = True,
    sitemap = True,
)
cr0wl3r.crawl()
from Cr0wl3r import _Crawler

class CustomCr0wl3r(_Crawler):
    def handle(self, url: str, *sources: str) -> None:
        print("[+] New url:", url, "from", *sources)
        print("[*] Getted in", self.urls_parsed[-1])
        print("[*] There are still", len(self.urls_to_parse), "requests to send.")

cr0wl3r = CustomCr0wl3r("https://github.com/mauricelambert", recursive=False)
cr0wl3r.crawl()
with open("urls.txt", 'w') as report:
    [report.write(url + '\n') for url in cr0wl3r.urls_getted]

Links

Help

~# python3 Cr0wl3r.py -h
usage: Cr0wl3r.py [-h] [--recursive] [--do-not-request-robots] [--do-not-request-sitemap] [--not-only-domain] [--max-request MAX_REQUEST]
                  [--cookie COOKIE] [--headers HEADERS [HEADERS ...]] [--tags-counter TAGS_COUNTER [TAGS_COUNTER ...]] [--report-filename REPORT_FILENAME]
                  [--loglevel {DEBUG,ERROR,CRITICAL,WARNING,INFO}] [--logfile LOGFILE] [--no-gui] [--script SCRIPT]
                  url

This script crawls web site and prints URLs.

positional arguments:
  url                   First URL to crawl.

options:
  -h, --help            show this help message and exit
  --recursive, -r       Crawl URLs recursively.
  --do-not-request-robots, --no-robots, -R
                        Don't search, request and parse robots.txt
  --do-not-request-sitemap, --no-sitemap, -S
                        Don't search, request and parse sitemap.xml
  --not-only-domain, -d
                        Do not request only the base URL domain (request all domains).
  --max-request MAX_REQUEST, -m MAX_REQUEST
                        Maximum request to perform.
  --cookie COOKIE, -c COOKIE
                        Add a cookie.
  --headers HEADERS [HEADERS ...], -H HEADERS [HEADERS ...]
                        Add headers.
  --tags-counter TAGS_COUNTER [TAGS_COUNTER ...], --tags TAGS_COUNTER [TAGS_COUNTER ...], -t TAGS_COUNTER [TAGS_COUNTER ...]
                        Add a tag counter for scoring.
  --report-filename REPORT_FILENAME, --report REPORT_FILENAME, -F REPORT_FILENAME
                        The JSON report filename.
  --loglevel {DEBUG,ERROR,CRITICAL,WARNING,INFO}, -l {DEBUG,ERROR,CRITICAL,WARNING,INFO}
                        WebSiteCloner logs level.
  --logfile LOGFILE, -f LOGFILE
                        WebCrawler logs file.
  --no-gui, -g          Don't use GUI for selenium, sometime Web Page doesn't respond correctly.
  --script SCRIPT, -s SCRIPT
                        A javascript script you want to run on the page, to wait a specific element created by javascript.
~# 

Licence

Licensed under the GPL, version 3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Cr0wl3r-0.0.3.tar.gz (24.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page