A library to create a bot / spider / crawler.
Project description
Exoskeleton
For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents.
Exoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.
Main functionalities are:
- Managing the download queue within a MariaDB database.
- Avoid processing the same URL more than once.
- Working through that queue by either
- downloading files to disk,
- storing the page source code into a database table,
- storing the page text,
- or making PDF-copies of webpages.
- Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
- Sending progress reports to the admin.
Documentation
How To Use Exoskeleton
- Installation and Requirements
- Create a Bot
- Dealing with result pages
- Avoiding duplicates
- The Queue: Downloading files / Saving the page code / Creating PDF
- Bot Behavior
- Progress Reports via Email
- File Versions and Labels
- Using the Blocklist
Example Uses
- Downloading an Archive : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.
Technical Documentation
Example
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging
import exoskeleton
logging.basicConfig(level=logging.DEBUG)
# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
project_name='Bot',
database_settings={'database': 'exoskeleton',
'username': 'exoskeleton',
'passphrase': ''},
# True, to stop after the queue is empty, Otherwise it will
# look consistently for new tasks in the queue:
bot_behavior={'stop_if_queue_empty': True},
filename_prefix='bot_',
chrome_name='chromium-browser',
target_directory='/home/myusername/myBot/'
)
exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
# chosen prefix followed by the database id and .txt.
exo.add_file_download(
'https://www.ruediger-voigt.eu/examplefile.txt',
{'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
# but the labels will be associated with the file in the
# database.
exo.add_file_download(
'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.
# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')
# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')
# work through the queue:
exo.process_queue()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
exoskeleton-1.2.4.tar.gz
(29.2 kB
view hashes)
Built Distribution
Close
Hashes for exoskeleton-1.2.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d44ad6b056f984163332a716517b96458543b4edbc82210c552ec52be5da01d8 |
|
MD5 | 599a8148fc056f905dad822f958f5746 |
|
BLAKE2b-256 | a010655ec903b77605809c0e64a556d4a1fb4d32c275837fba435d78958d2c24 |