Skip to main content

Website crawling bot

Project description

Description

The NightCrawler is site crawling/spider tool to gather links at the given domain by walking through the whole site and generating simple sitemap.

Limitations

This tools is just a demo. It’s single-threaded script that walks every page it gets and it’s not optimized for speed.

The script sticks to the url provided and does not dive into subdomains of the given domain even if encounters internal redirect like example.com -> www.example.com

Possible enhancements

  • Use multi-threading with thread pools

  • Use generators to lower memory footprint and gain a bit more speed

  • Make preliminary HEAD request to distinguish between text and binary files

  • Check Content-Type and exclude files that are not HTMLs

  • Add matchers and sitemap generators for additional sitemap flavour (images, videos, etc.)

  • More tests (already included tests are only for the most critical classes)

Installation

1. Requirements

  1. Python >= 3.2

  2. PIP

2a. Installation without virtualenv

Run the following command in shell:

pip install NightCrawler

2b. Installation in virtualenv

Run the following command in shell:

virtualenv .env
. .env/bin/activate
pip install NightCrawler

2c. Installation from source (development)

To install the package from source one have to create virtualenv after cloning the repository

git clone https://github.com/szczad/NightCrawler.git
cd NightCrawler
virtualenv .env
. .env/bin/activate
pip install -e ./

3. (optional) Testing

When installed from sources in development mode the script can be tested with the following command

. .env/bin/activate
python setup.py test

Running the script

0. Help

nightcrawler --help

1. Running the script installed globally

nightcrawler <url|domain>

2. Running the script installed in virtualenv

<path_to_virtualenv>/bin/nightcrawler <url|domain>

or

. .env/bin/activate
nightcrawler <url|domain>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NightCrawler-0.1.6.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

NightCrawler-0.1.6-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file NightCrawler-0.1.6.tar.gz.

File metadata

  • Download URL: NightCrawler-0.1.6.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/2.7.15

File hashes

Hashes for NightCrawler-0.1.6.tar.gz
Algorithm Hash digest
SHA256 85bbcaeeb7817c88c542278047ce7d92628f29905a19b8b1b1cf0798e5488ef9
MD5 d741471fa100b57519f521aff7cfc62f
BLAKE2b-256 e0e7e87f6beee9f249e9ddcae247c835762b598685607036f6ccec61fc41efbc

See more details on using hashes here.

File details

Details for the file NightCrawler-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: NightCrawler-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/2.7.15

File hashes

Hashes for NightCrawler-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8f7ef6472322297c6faa4cee16e74688bd791844dfc8ed924c1e53638dd1b11c
MD5 4b4c296921da39941b268435ace11227
BLAKE2b-256 761c3f936abbd8adda5863dbcf7d2e432ccd9d544481e8a0a661521599e47014

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page