Website crawling bot
Project description
Description
The NightCrawler is site crawling/spider tool to gather links at the given domain by walking through the whole site and generating simple sitemap.
Limitations
This tools is just a demo. It’s single-threaded script that walks every page it gets and it’s not optimized for speed.
The script sticks to the url provided and does not dive into subdomains of the given domain even if encounters internal redirect like example.com -> www.example.com
Possible enhancements
Use multi-threading with thread pools
Use generators to lower memory footprint and gain a bit more speed
Make preliminary HEAD request to distinguish between text and binary files
Check Content-Type and exclude files that are not HTMLs
Add matchers and sitemap generators for additional sitemap flavour (images, videos, etc.)
More tests (already included tests are only for the most critical classes)
Installation
1. Requirements
Python >= 3.2
PIP
2a. Installation without virtualenv
Run the following command in shell:
pip install NightCrawler
2b. Installation in virtualenv
Run the following command in shell:
virtualenv .env
. .env/bin/activate
pip install NightCrawler
2c. Installation from source (development)
To install the package from source one have to create virtualenv after cloning the repository
git clone https://github.com/szczad/NightCrawler.git
cd NightCrawler
virtualenv .env
. .env/bin/activate
pip install -e ./
3. (optional) Testing
When installed from sources in development mode the script can be tested with the following command
. .env/bin/activate
python setup.py test
Running the script
0. Help
nightcrawler --help
1. Running the script installed globally
nightcrawler <url|domain>
2. Running the script installed in virtualenv
<path_to_virtualenv>/bin/nightcrawler <url|domain>
or
. .env/bin/activate
nightcrawler <url|domain>
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file NightCrawler-0.1.6.tar.gz
.
File metadata
- Download URL: NightCrawler-0.1.6.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/2.7.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85bbcaeeb7817c88c542278047ce7d92628f29905a19b8b1b1cf0798e5488ef9 |
|
MD5 | d741471fa100b57519f521aff7cfc62f |
|
BLAKE2b-256 | e0e7e87f6beee9f249e9ddcae247c835762b598685607036f6ccec61fc41efbc |
File details
Details for the file NightCrawler-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: NightCrawler-0.1.6-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/2.7.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f7ef6472322297c6faa4cee16e74688bd791844dfc8ed924c1e53638dd1b11c |
|
MD5 | 4b4c296921da39941b268435ace11227 |
|
BLAKE2b-256 | 761c3f936abbd8adda5863dbcf7d2e432ccd9d544481e8a0a661521599e47014 |