Skip to main content

A library to create a bot / spider / crawler.

Project description

Exoskeleton

For my dissertation I download hundreds of thousands of documents and feed them into a ML system. Using a 1 Gbit/s connection is helpful, but carries the risk to run a involuntary denial-of-service attack on the servers that provide the documents.

That creates a need for a crawler / scraper that avoids too high loads on the connection, but runs permanently and fault tolerant to ultimately download all files.

Exoskeleton is a python framework that aims for that goal. It has four main functionalities:

  • Managing a download queue within a SQL database.
  • Working through that queue by downloading files to disk and page source code into a database table.
  • Avoid processing the same URL more than once.
  • Sending progress reports to the admin.

To analyze the content of a page I recommend the Beautiful Soup package.

Installation and Use

Please take note that exoskeleton’s development status is "beta version". This means it may still contain some bugs and some commands could change with one of the next releases.

  1. Exoskeleton requires a database backend. Create a separate database for your project and create the necessary tables. You find scripts to create them on the GitHub project page within the folder named Database-Scripts
  2. Create a database user with read / write / update rights for this database. The crawler will use it to access and manage the queue. That account needs no permissions on other database and therefore should not have them.
  3. Install exoskeleton using pip or pip3. For example: pip install exoskeleton. You may consider using a virtualenv.
  4. Exoskeleton sets reasonable defaults, but you have to set at least some parameters. See the code examples below.
  5. Add something to the queue and let exoskeleton do it's job.

Examples

Basic Functionality

First create a database and a separate user for your bot. Then use the Database-Script to create the table structure.

Put username and passphrase for the database into a separate file called credentials.py. If you store your bots in git, it might be a good idea to exclude the credentials file from uploads via the ignore list.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# File: credentials.py
user = 'databaseusername'
passphrase = 'secret_passphrase'

Now create a file that contains your bot:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# File: bot.py

import logging
import exoskeleton
import credentials

# exoskeleton makes heavy use of the built-in
# logging functionality. Change the level to
# INFO to see less messages.
logging.basicConfig(level=logging.DEBUG)

# create an object to setup the framework
queueManager = exoskeleton.Exoskeleton(
    database_host='ruediger-voigt.eu',
    database_name='exoskeleton',
    database_user=credentials.user,
    database_passphrase=credentials.passphrase
)

print(queueManager.num_items_in_queue())
print(queueManager.estimate_remaining_time())

Run the bot to see if the database connection works. The output with this setup should be:

INFO:root:You are using exoskeleton in version 0.5.0 (beta)
INFO:root:No port number supplied. Will try standard port instead.
WARNING:root:No mail address supplied. Unable to send emails.
WARNING:root:No mail address supplied. Unable to send emails.
WARNING:root:Target directory is not set. Using the current working directory /home/censored_path to store files!
DEBUG:root:Chosen hashing method is available on the system.
INFO:root:Hash method set to sha1
INFO:root:sha1 is fast, but a weak hashing algorithm. Consider using another method if security is important.
DEBUG:root:started timer
DEBUG:root:Trying to connect to database.
INFO:root:Made database connection.
DEBUG:root:Checking if the database table structure is complete.
DEBUG:root:Found table actions
DEBUG:root:Found table queue
DEBUG:root:Found table errorType
DEBUG:root:Found table eventLog
DEBUG:root:Found table fileMaster
DEBUG:root:Found table storageTypes
DEBUG:root:Found table fileVersions
DEBUG:root:Found table fileContent
INFO:root:Found all expected tables.
0
WARNING:root:Cannot estimate remaining time as there are no data so far.
-1

There is nothing in the queue and it is not possible to estimate time as the crawler did not run. So let's change that by adding some things to the queue:

queueManager.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
queueManager.add_file_download('https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
queueManager.add_save_page_code('https://www.ruediger-voigt.eu/')

Now tell your bot to work through the queue:

queueManager.process_queue()

After Exoskeleton worked through the queue, it will enter a wait state.

The idea behind this behavior is, that multiple scripts can feed the queue. There might be the situation that the queue is empty, but new tasks will be entered some seconds later. So standard behavior for exoskeleton is to check the queue regulary.

You can change that behavior by setting an optional a parameter. Change the code above to:

queueManager = exoskeleton.Exoskeleton(
    database_host='ruediger-voigt.eu',
    database_name='exoskeleton',
    database_user=credentials.user,
    database_passphrase=credentials.passphrase,
    queue_stop_on_empty=True # NEW
)

Now exoskelton will stop once the queue is empty.

Sending Progress Reports by Email

Exoskelton can send email when it reaches a milestone or finishes the job.

Note, that it usually does not work to send email from a system with a dynamic ip-address as most mail servers will classify them as spam. Even if you send from a machine with static IP many things might go wrong. For example there might be a SPF setting for the sending domain.

For this reason the parameter mail_send_start defaults to True. Once a sender and a receiver are defined, the bot tries to send an email. Once you have a working setup, you can switch that off by setting the Parameter to False.

Further Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exoskeleton-0.6.1.tar.gz (14.0 kB view hashes)

Uploaded Source

Built Distribution

exoskeleton-0.6.1-py3-none-any.whl (17.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page