A library to create a bot / spider / crawler.

These details have not been verified by PyPI

Project links

Homepage

Project description

Exoskeleton

Build System Test Supported Python Versions Last commit pypi version

For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection is helpful but carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents. This creates a need for a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Exoskeleton is a python framework that aims to help you build a similar bot. Main functionalities are:

Managing a download queue within a MariaDB database.
Avoid processing the same URL more than once.
Working through that queue by either
- downloading files to disk,
- storing the page source code into a database table,
- storing the page text,
- or making PDF-copies of webpages.
Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
Sending progress reports to the admin.

Exoskeleton has an extensive documentation.

Two other python libraries were created as part of this project:

userprovided : check user input for validity and plausibility / covert input into better formats
bote : send messages (currently via a local or remote SMTP server)

Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging

import exoskeleton

logging.basicConfig(level=logging.DEBUG)

# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
    project_name='Bot',
    database_settings={'database': 'exoskeleton',
                       'username': 'exoskeleton',
                       'passphrase': ''},
    # True, to stop after the queue is empty, Otherwise it will
    # look consistently for new tasks in the queue:
    bot_behavior={'stop_if_queue_empty': True},
    filename_prefix='bot_',
    chrome_name='chromium-browser',
    target_directory='/home/myusername/myBot/'
)

exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
#    chosen prefix followed by the database id and .txt.

exo.add_file_download(
    'https://www.ruediger-voigt.eu/examplefile.txt',
    {'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
#    but the labels will be associated with the file in the
#    database.


exo.add_file_download(
    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.

# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')

# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')

# work through the queue:
exo.process_queue()

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.1.1

Apr 27, 2022

2.1.0

Nov 3, 2021

2.0.0

Jul 2, 2021

1.3.0

May 29, 2021

1.2.5

Apr 18, 2021

1.2.4

Mar 16, 2021

1.2.3

Feb 7, 2021

1.2.2

Jan 20, 2021

This version

1.2.1

Nov 30, 2020

1.2.0

Nov 16, 2020

1.1.0

Oct 29, 2020

1.0.0

Jul 23, 2020

0.9.3

Jul 10, 2020

0.9.2

Jul 7, 2020

0.9.1

Jun 22, 2020

0.9.0

Apr 27, 2020

0.8.2

Feb 21, 2020

0.8.1

Feb 18, 2020

0.8.0

Feb 14, 2020

0.7.1

Jan 27, 2020

0.7.0

Jan 16, 2020

0.6.3

Dec 20, 2019

0.6.2

Dec 11, 2019

0.6.1

Dec 7, 2019

0.6.0

Nov 26, 2019

0.5.2

Nov 7, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exoskeleton-1.2.1.tar.gz (28.1 kB view hashes)

Uploaded Nov 30, 2020 Source

Built Distribution

exoskeleton-1.2.1-py3-none-any.whl (37.4 kB view hashes)

Uploaded Nov 30, 2020 Python 3

Hashes for exoskeleton-1.2.1.tar.gz

Hashes for exoskeleton-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`e81bd95f91d339be24049c029b5eb6789c52bfdf431de842c025e294bd5e8014`
MD5	`96e12d0c77ab76994c7af1b3ea1bed89`
BLAKE2b-256	`05a5340ef762b7f727e6dc1b7cb30dd15bc376d1f0b2b9edc41bda6efa29e074`

Hashes for exoskeleton-1.2.1-py3-none-any.whl

Hashes for exoskeleton-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb5b34643361186f81c00ddc533ac36c820f861b7e0c811456c406f4874efbe0`
MD5	`fb13d2d6b0231cdaa0bdce922c0bed5e`
BLAKE2b-256	`bb77bf32ebfd0abb5fc460d47193e6e1b2b59faf6120a17696ad838e25bb4d32`