A library to create a bot / spider / crawler.

These details have not been verified by PyPI

Project links

Homepage

Project description

Exoskeleton

pypi version Supported Python Versions Build Last commit

Machine Learning and other applications make it necessary to download thousands or sometimes hundreds of thousands of files.

Using a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those files and webpages.

Exoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Its main functionalities are:

Managing the download queue and document data within a MariaDB database.
Avoid processing the same URL more than once.
Working through the queue by either
- downloading files to disk,
- storing the page source code into a database table,
- storing the page text,
- or making PDF-copies of webpages.
Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
Sending progress reports to the admin.

Documentation

How To Use Exoskeleton

Example Uses

Downloading an Archive : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.

Technical Documentation

Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging

import exoskeleton

logging.basicConfig(level=logging.DEBUG)

# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
    project_name='Bot',
    database_settings={'database': 'exoskeleton',
                       'username': 'exoskeleton',
                       'passphrase': ''},
    # True, to stop after the queue is empty, Otherwise it will
    # look consistently for new tasks in the queue:
    bot_behavior={'stop_if_queue_empty': True},
    filename_prefix='bot_',
    chrome_name='chromium-browser',
    target_directory='/home/myusername/myBot/'
)

exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
#    chosen prefix followed by the database id and .txt.

exo.add_file_download(
    'https://www.ruediger-voigt.eu/examplefile.txt',
    {'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
#    but the labels will be associated with the file in the
#    database.


exo.add_file_download(
    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.

# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')

# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')

# work through the queue:
exo.process_queue()

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.1.1

Apr 27, 2022

2.1.0

Nov 3, 2021

2.0.0

Jul 2, 2021

1.3.0

May 29, 2021

1.2.5

Apr 18, 2021

1.2.4

Mar 16, 2021

1.2.3

Feb 7, 2021

1.2.2

Jan 20, 2021

1.2.1

Nov 30, 2020

1.2.0

Nov 16, 2020

1.1.0

Oct 29, 2020

1.0.0

Jul 23, 2020

0.9.3

Jul 10, 2020

0.9.2

Jul 7, 2020

0.9.1

Jun 22, 2020

0.9.0

Apr 27, 2020

0.8.2

Feb 21, 2020

0.8.1

Feb 18, 2020

0.8.0

Feb 14, 2020

0.7.1

Jan 27, 2020

0.7.0

Jan 16, 2020

0.6.3

Dec 20, 2019

0.6.2

Dec 11, 2019

0.6.1

Dec 7, 2019

0.6.0

Nov 26, 2019

0.5.2

Nov 7, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exoskeleton-2.1.1.tar.gz (28.7 kB view details)

Uploaded Apr 27, 2022 Source

Built Distribution

exoskeleton-2.1.1-py3-none-any.whl (40.9 kB view details)

Uploaded Apr 27, 2022 Python 3

File details

Details for the file exoskeleton-2.1.1.tar.gz.

File metadata

Download URL: exoskeleton-2.1.1.tar.gz
Upload date: Apr 27, 2022
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for exoskeleton-2.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b9ebca371be6ed368a460d0ecb2b692f8bfc73a4434aac0276c57500259f6d21`
MD5	`f42fe31587875d8f80293f55df64f0e2`
BLAKE2b-256	`f4b7824212638aba455b5a661d577adf2b3b901b9549bb7b93f3fa5d0d334199`

See more details on using hashes here.

File details

Details for the file exoskeleton-2.1.1-py3-none-any.whl.

File metadata

Download URL: exoskeleton-2.1.1-py3-none-any.whl
Upload date: Apr 27, 2022
Size: 40.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for exoskeleton-2.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c84298fa2ee2a0b2c63d3cfe384dcc4fa62e156bb42ce9288ffa7fe81dd9c44`
MD5	`f25f23ce2bd2a9a78f874cf91ba2908d`
BLAKE2b-256	`9f5298e510b3fc9fdba27ee347419a047ea55b7d52755590abd956772857de5d`

See more details on using hashes here.

exoskeleton 2.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Exoskeleton

Documentation

How To Use Exoskeleton

Example Uses

Technical Documentation

Example

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes