For crawling web file explorers for content
Project description
The Crawler
Web crawling utility for downloading files from web pages.
Installation
From PyPI
This assumes you have Python 3.10+ installed and pip3
is on
your path:
~$ pip3 install the-crawler
...
~$ the-crawler -h
usage: the-crawler [-h] [--recurse] [--output-directory OUTPUT_DIRECTORY] [--extensions EXTENSIONS [EXTENSIONS ...]] [--max-workers MAX_WORKERS] base_url
Crawls given url for content
positional arguments:
base_url
options:
-h, --help show this help message and exit
--recurse, -r
--output-directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY
--extensions EXTENSIONS [EXTENSIONS ...], -e EXTENSIONS [EXTENSIONS ...]
--max-workers MAX_WORKERS
From Source
This assumes you have git, Python 3.10+, and poetry installed already.
~$ git clone git@gitlab.com:woodforsheep/the-crawler.git
...
~$ cd the-crawler
the-crawler$ poetry install
...
the-crawler$ poetry run the-crawler -h
usage: the-crawler [-h] [--quiet] [--verbose] [--collect-only] [--force-collection] [--recurse]
[--output-directory OUTPUT_DIRECTORY] [--extensions [EXTENSIONS]]
[--max-workers MAX_WORKERS]
base_url
Crawls given url for content
positional arguments:
base_url
options:
-h, --help show this help message and exit
--quiet Changes the console log level from INFO to WARNING; defers to --verbose
--verbose Changes the console log level from INFO to DEBUG; takes precedence over
--quiet
--collect-only Stops after collecting links to be downloaded; useful for checking the
cache before continuing
--force-collection Forces recollection of links, even if the cache file is present
--recurse, -r If specified, will follow links to child pages and search them for
content
--output-directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY
The location to store the downloaded content; must already exist
--extensions [EXTENSIONS], -e [EXTENSIONS]
If specified, will restrict the types of files downloaded to those
matching the extensions provided; case-insensitive
--max-workers MAX_WORKERS
The maximum number of parallel downloads to support; defaults to
os.cpu_count()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
the_crawler-0.5.0.tar.gz
(5.0 kB
view details)
Built Distribution
File details
Details for the file the_crawler-0.5.0.tar.gz
.
File metadata
- Download URL: the_crawler-0.5.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.19.0-50-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f8dd6e2ae48a12dedcb20001ca8a2d4731dd8910c0f7cbca8e51fd8125415ba |
|
MD5 | 5ff1c6753fc70070522bf5556e1c2096 |
|
BLAKE2b-256 | 7f73dfd0d5b05a0903a0c2fc4a95acaa170b62f69c3e45635c151d7e23682da6 |
File details
Details for the file the_crawler-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: the_crawler-0.5.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.19.0-50-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bac8676821115c5d6e103de965bf61bc76c9aa51411d70ff9a946176d7c9db97 |
|
MD5 | 1b581e0a9517f93474b2e3b7e422b0cc |
|
BLAKE2b-256 | 2efd920f8ac8aa3073927cb82fbed939760d9b31b9853d280d40804824a077dc |