Skip to main content

Website crawling bot

Project description

Description

The NightCrawler is site crawling/spider tool to gather links at the given domain by walking through the whole site and generating simple sitemap.

Limitations

This tools is just a demo. It’s single-threaded script that walks every page it gets and it’s not optimized for speed.

The script sticks to the url provided and does not dive into subdomains of the given domain even if encounters internal redirect like example.com -> www.example.com

Possible enhancements

  • Use multi-threading with thread pools

  • Use generators to lower memory footprint and gain a bit more speed

  • Make preliminary HEAD request to distinguish between text and binary files

  • Check Content-Type and exclude files that are not HTMLs

  • Add matchers and sitemap generators for additional sitemap flavour (images, videos, etc.)

  • More tests (already included tests are only for the most critical classes)

Installation

1. Requirements

  1. Python >= 3.2

  2. PIP

2a. Installation without virtualenv

Run the following command in shell:

pip install NightCrawler

2b. Installation in virtualenv

Run the following command in shell:

virtualenv .env
. .env/bin/activate
pip install NightCrawler

2c. Installation from source (development)

To install the package from source one have to create virtualenv after cloning the repository

git clone https://github.com/szczad/NightCrawler.git
cd NightCrawler
virtualenv .env
. .env/bin/activate
pip install -e ./

3. (optional) Testing

When installed from sources in development mode the script can be tested with the following command

. .env/bin/activate
python setup.py test

Running the script

0. Help

nightcrawler --help

1. Running the script installed globally

nightcrawler <url|domain>

2. Running the script installed in virtualenv

<path_to_virtualenv>/bin/nightcrawler <url|domain>

or

. .env/bin/activate
nightcrawler <url|domain>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NightCrawler-0.1.6.tar.gz (5.1 kB view hashes)

Uploaded Source

Built Distribution

NightCrawler-0.1.6-py3-none-any.whl (8.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page