Skip to main content

A simple python web crawler

Project description

github latest release pypi latest package docker latest image project license

e6c CI e6c CD security scan pre-commit

code coverage code alerts code quality code style

ekrhizoc

ekrhizoc (E6c): A web crawler

Contents

  1. Definition
  2. Use Case
  3. Configuration
  4. Development
  5. Testing
  6. Versioning
  7. Deployment
  8. Production

Definition

εκρίζωση (Greek) ekrízosi / uprooting, eradication

Also known as E6c.

Use Case

Implementation of a simple python web crawler.
Input: URL (seed).
Output: Simple textual sitemap (to show links between pages).

Requirements

  • The crawler is limited to one subdomain (exclude external links).
  • No use of web crawling libraries/frameworks (e.g. scrapy).
  • (Optional) Use of HTML handling Libraries/Frameworks.
  • Production-ready code.

Assumptions

  • The input URL (seed) is limited to only one at every run.
  • The targeted URL(s) are static pages (no backend javascript parsing required).
  • Links to be extracted from HTML anchor <a> elements.
  • Valid links include
    • Valid URL
      • Non empty
      • Matches a valid url pattern
      • Does not exceed the E6C_MAX_URL_LENGTH length in characters
      • Possible to convert a relative urls to a full url
    • Link is not visited before
    • Link is not part of an ignored file type
    • Link has the same domain as the seed url
    • Link is not restricted by the robots.txt file

Design

This project implements a Basic Universal Crawler based on breadth first search graph traversal.

Configuration

Behaviour of the application can be configured via Environment Variables.

Environment Variable Description Type Default Value
E6C_LOG_LEVEL Level of logging - overrides verbose/quiet flag string -
E6C_LOG_DIR Directory to save logs string -
E6C_BIN_DIR Directory to save any output (bin) string bin
E6C_IGNORE_FILETYPES File types of websites to ignore (e.g. ".filetype1,.filetype2") string ".png,.pdf,.txt,.doc,.jpg,.gif"
E6C_URL_REQUEST_TIMER Time (in seconds) to wait per request (not to populate server with multiple requests) float 0.1
E6C_MAX_URLS The maximum number of urls to fetch/crawl integer 10000
E6C_MAX_URL_LENGTH The maximum length (character count) of a url to fetch/crawl integer 300

Development

Configure your local development

  • Clone repo on your local machine
  • Install conda or miniconda
  • Create your local project environment (based on conda, poetry, pre-commit):
    $ make env
  • (Optional) Update existing local project environment:
    $ make env-update

Run locally

On a terminal, run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • Run the CLI using poetry:
    $ ekrhizoc

Contribute

[ Not Available ]

Testing

(part of CI/CD)

[ Work in progress... ]

To run the tests, open a terminal and run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • To run pytest:
    $ make test
  • To check test coverage:
    $ make test-coverage

Versioning

Increment the version number:
$ poetry version {bump rule}
where valid bump rules are:

  1. patch
  2. minor
  3. major
  4. prepatch
  5. preminor
  6. premajor
  7. prerelease

Changelog

Use CHANGELOG.md to track the evolution of this package.
The [UNRELEASED] tag at the top of the file should always be there to log the work until a release occurs.

Work should be logged under one of the following subtitles:

  • Added
  • Changed
  • Fixed
  • Removed

On a release, a version of the following format should be added to all the current unreleased changes in the file.
## [major.minor.patch] - YYYY-MM-DD

Deployment

Pip package

On a terminal, run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • To build pip package:
    $ make build-package
  • To publish pip package (requires credentials to PyPi):
    $ make publish-package

Docker image

On a terminal, run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • To build docker image:
    $ make build-docker

Production

For production, a Docker image is used. This image is published publicly on docker hub.

  • First pull image from docker hub:
    $ docker pull nichelia/ekrhizoc:{version}
  • Execute CLI via docker run:
    $ docker run --rm -it -v ~/ekrhizoc_bin:/tmp/bin nichelia/ekrhizoc:{version} {command}
    This command mounts the application's bin (outcome) to user's root directory under ekrhizoc_bin folder.

where version is the published application version

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ekrhizoc-0.1.2.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ekrhizoc-0.1.2-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file ekrhizoc-0.1.2.tar.gz.

File metadata

  • Download URL: ekrhizoc-0.1.2.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.0 CPython/3.8.3 Linux/5.3.0-1028-azure

File hashes

Hashes for ekrhizoc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 89eeecd50a60d3d6fde8bd0881368230a2cd44e38609049433880998e6ef0540
MD5 fc6fcd69d606729e358ec8bf9daec083
BLAKE2b-256 c3124a6f87cb29bd0c32af7e8b801078ce200a623ec8bbcab325d479c867077d

See more details on using hashes here.

File details

Details for the file ekrhizoc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ekrhizoc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.0 CPython/3.8.3 Linux/5.3.0-1028-azure

File hashes

Hashes for ekrhizoc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 898ce87267f27798fb1cac4644f38d27b779bba868eae46349cb3270cfb35804
MD5 a7bf8b07ec5a318cb89b24aeef74ca72
BLAKE2b-256 7969cfbd1f5398df1ad6971c122172f2a111f632c68ba6e21fd2eb6af3a851d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page