Skip to main content
Help us improve PyPI by participating in user testing. All experience levels needed!

Explore a website recursively and download all the wanted documents (PDF, ODT…)

Project description

doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).

== Synopsis
doc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://…
doc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst
doc_crawler.py [--wait=0] --download-file http://…
or
python3 -m doc_crawler […] http://…

== Description
_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the
descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…)
based on regular expression matching (typically against their extension). Documents can be
listed on the standard output or downloaded (with the _--download_ argument).

To address real life situations, activities can be logged (with _--verbose_). +
Also, the search can be limited to one page (with the _--single-page_ argument).

Documents can be downloaded from a given list of URL, that you may have previously
produced using default options of _doc_crawler_ and an output redirection such as: +
`./doc_crawler.py http://… > url.lst`

Documents can also be downloaded one by one if necessary (to finish the work), using the
_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you
at every steps.

By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each
download to avoid being rude toward the webserver it interacts with (and so avoid being
black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_
argument).

_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://…`

== Options
*--accept*=_jpe?g$_::
Optional regular expression (case insensitive) to keep matching document names.
Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg
*--download*::
Directly downloads found documents if set, output their URL if not.
*--single-page*::
Limits the search for documents to download to the given URL.
*--verbose*::
Creates a log file to keep trace of what was done.
*--wait*=x::
Change the default waiting time before each download (page or document).
Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.
*--no-random-wait*::
Stops the random pick up of waiting times. _--wait=_ or default is used.
*--download-files* url.lst::
Downloads each documents which URL are listed in the given file.
Example : _--download-files url.lst_
*--download-file* http://…::
Directly save in the current folder the URL-pointed document.

== Tests
Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following
command in the cloned repository root: +
`python3 -m doctest doc_crawler.py`

Tests can also be launched one by one using the _--test=XXX_ argument: +
`python3 -m doc_crawler --test=download_file`

Tests are successfully passed if nothing is output.

== Requirements
- requests
- yaml

One can install them under Debian using the following command : `apt install python3-requests python3-yaml`

== Author
Simon Descarpentries - https://s.d12s.fr

== Ressources
Github repository : https://github.com/Siltaar/doc_crawler.py +
Pypi repository : https://pypi.python.org/pypi/doc_crawler

== Support
To support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar

== Licence
GNU General Public License v3.0. See LICENCE file for more information.

Project details


Release history Release notifications

This version
History Node

1.2

History Node

1.1

History Node

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
doc_crawler-1.2.tar.gz (6.2 kB) Copy SHA256 hash SHA256 Source None Mar 7, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page