Explore a website recursively and download all the wanted documents (PDF, ODT…)
Project description
doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).
== Synopsis
doc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://…
doc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst
doc_crawler.py [--wait=0] --download-file http://…
or
python3 -m doc_crawler […] http://…
== Description
_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the
descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…)
based on regular expression matching (typically against their extension). Documents can be
listed on the standard output or downloaded (with the _--download_ argument).
To address real life situations, activities can be logged (with _--verbose_). +
Also, the search can be limited to one page (with the _--single-page_ argument).
Documents can be downloaded from a given list of URL, that you may have previously
produced using default options of _doc_crawler_ and an output redirection such as: +
`./doc_crawler.py http://… > url.lst`
Documents can also be downloaded one by one if necessary (to finish the work), using the
_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you
at every steps.
By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each
download to avoid being rude toward the webserver it interacts with (and so avoid being
black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_
argument).
_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://…`
== Options
*--accept*=_jpe?g$_::
Optional regular expression (case insensitive) to keep matching document names.
Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg
*--download*::
Directly downloads found documents if set, output their URL if not.
*--single-page*::
Limits the search for documents to download to the given URL.
*--verbose*::
Creates a log file to keep trace of what was done.
*--wait*=x::
Change the default waiting time before each download (page or document).
Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.
*--no-random-wait*::
Stops the random pick up of waiting times. _--wait=_ or default is used.
*--download-files* url.lst::
Downloads each documents which URL are listed in the given file.
Example : _--download-files url.lst_
*--download-file* http://…::
Directly save in the current folder the URL-pointed document.
== Tests
Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following
command in the cloned repository root: +
`python3 -m doctest doc_crawler.py`
Tests can also be launched one by one using the _--test=XXX_ argument: +
`python3 -m doc_crawler --test=download_file`
Tests are successfully passed if nothing is output.
== Requirements
- requests
- yaml
One can install them under Debian using the following command : `apt install python3-requests python3-yaml`
== Author
Simon Descarpentries - https://s.d12s.fr
== Ressources
Github repository : https://github.com/Siltaar/doc_crawler.py +
Pypi repository : https://pypi.python.org/pypi/doc_crawler
== Support
To support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar
== Licence
GNU General Public License v3.0. See LICENCE file for more information.
== Synopsis
doc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://…
doc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst
doc_crawler.py [--wait=0] --download-file http://…
or
python3 -m doc_crawler […] http://…
== Description
_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the
descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…)
based on regular expression matching (typically against their extension). Documents can be
listed on the standard output or downloaded (with the _--download_ argument).
To address real life situations, activities can be logged (with _--verbose_). +
Also, the search can be limited to one page (with the _--single-page_ argument).
Documents can be downloaded from a given list of URL, that you may have previously
produced using default options of _doc_crawler_ and an output redirection such as: +
`./doc_crawler.py http://… > url.lst`
Documents can also be downloaded one by one if necessary (to finish the work), using the
_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you
at every steps.
By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each
download to avoid being rude toward the webserver it interacts with (and so avoid being
black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_
argument).
_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://…`
== Options
*--accept*=_jpe?g$_::
Optional regular expression (case insensitive) to keep matching document names.
Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg
*--download*::
Directly downloads found documents if set, output their URL if not.
*--single-page*::
Limits the search for documents to download to the given URL.
*--verbose*::
Creates a log file to keep trace of what was done.
*--wait*=x::
Change the default waiting time before each download (page or document).
Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.
*--no-random-wait*::
Stops the random pick up of waiting times. _--wait=_ or default is used.
*--download-files* url.lst::
Downloads each documents which URL are listed in the given file.
Example : _--download-files url.lst_
*--download-file* http://…::
Directly save in the current folder the URL-pointed document.
== Tests
Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following
command in the cloned repository root: +
`python3 -m doctest doc_crawler.py`
Tests can also be launched one by one using the _--test=XXX_ argument: +
`python3 -m doc_crawler --test=download_file`
Tests are successfully passed if nothing is output.
== Requirements
- requests
- yaml
One can install them under Debian using the following command : `apt install python3-requests python3-yaml`
== Author
Simon Descarpentries - https://s.d12s.fr
== Ressources
Github repository : https://github.com/Siltaar/doc_crawler.py +
Pypi repository : https://pypi.python.org/pypi/doc_crawler
== Support
To support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar
== Licence
GNU General Public License v3.0. See LICENCE file for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
doc_crawler-1.2.tar.gz
(6.2 kB
view details)
File details
Details for the file doc_crawler-1.2.tar.gz
.
File metadata
- Download URL: doc_crawler-1.2.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 148a2f660520a6334ebc6c19721776642dd458288fb091cd4e42554cb0d8453c |
|
MD5 | 4a9ad71302fffd7a30901eefe1caa3a8 |
|
BLAKE2b-256 | c61599098901d30e2d055c138be7d594ab14794bc3475bd0713bcc8c0df305b3 |