Skip to main content

A simple python web crawler

Project description

github latest release pypi latest package docker latest image project license

e6c CI e6c CD security scan pre-commit

code coverage code alerts code quality code style

ekrhizoc

ekrhizoc (E6c): A web crawler

Contents

  1. Definition
  2. Use Case
  3. Configuration
  4. Development
  5. Testing
  6. Versioning
  7. Deployment
  8. Production

Definition

εκρίζωση (Greek) ekrízosi / uprooting, eradication

Also known as E6c.

Use Case

Implementation of a simple python web crawler.
Input: URL (seed).
Output: Simple textual sitemap (to show links between pages).

Requirements

  • The crawler is limited to one subdomain (exclude external links).
  • No use of web crawling libraries/frameworks (e.g. scrapy).
  • (Optional) Use of HTML handling Libraries/Frameworks.
  • Production-ready code.

Assumptions

  • The input URL (seed) is limited to only one at every run.
  • The targeted URL(s) are static pages (no backend javascript parsing required).
  • Links to be extracted from HTML anchor <a> elements.
  • Valid links include
    • Valid URL
      • Non empty
      • Matches a valid url pattern
      • Does not exceed the E6C_MAX_URL_LENGTH length in characters
      • Possible to convert a relative urls to a full url
    • Link is not visited before
    • Link is not part of an ignored file type
    • Link has the same domain as the seed url
    • Link is not restricted by the robots.txt file

Design

This project implements a Basic Universal Crawler based on breadth first search graph traversal.

Configuration

Behaviour of the application can be configured via Environment Variables.

Environment Variable Description Type Default Value
E6C_LOG_LEVEL Level of logging - overrides verbose/quiet flag string -
E6C_LOG_DIR Directory to save logs string -
E6C_BIN_DIR Directory to save any output (bin) string bin
E6C_IGNORE_FILETYPES File types of websites to ignore (e.g. ".filetype1,.filetype2") string ".png,.pdf,.txt,.doc,.jpg,.gif"
E6C_URL_REQUEST_TIMER Time (in seconds) to wait per request (not to populate server with multiple requests) float 0.1
E6C_MAX_URLS The maximum number of urls to fetch/crawl integer 10000
E6C_MAX_URL_LENGTH The maximum length (character count) of a url to fetch/crawl integer 300

Development

Configure your local development

  • Clone repo on your local machine
  • Install conda or miniconda
  • Create your local project environment (based on conda, poetry, pre-commit):
    $ make env
  • (Optional) Update existing local project environment:
    $ make env-update

Run locally

On a terminal, run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • Run the CLI using poetry:
    $ ekrhizoc

Contribute

[ Not Available ]

Testing

(part of CI/CD)

[ Work in progress... ]

To run the tests, open a terminal and run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • To run pytest:
    $ make test
  • To check test coverage:
    $ make test-coverage

Versioning

Increment the version number:
$ poetry version {bump rule}
where valid bump rules are:

  1. patch
  2. minor
  3. major
  4. prepatch
  5. preminor
  6. premajor
  7. prerelease

Changelog

Use CHANGELOG.md to track the evolution of this package.
The [UNRELEASED] tag at the top of the file should always be there to log the work until a release occurs.

Work should be logged under one of the following subtitles:

  • Added
  • Changed
  • Fixed
  • Removed

On a release, a version of the following format should be added to all the current unreleased changes in the file.
## [major.minor.patch] - YYYY-MM-DD

Deployment

Pip package

On a terminal, run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • To build pip package:
    $ make build-package
  • To publish pip package (requires credentials to PyPi):
    $ make publish-package

Docker image

On a terminal, run the following (execute on project's root directory):

  • Activate project environment:
    $ . ./scripts/helpers/environment.sh
  • To build docker image:
    $ make build-docker

Production

For production, a Docker image is used. This image is published publicly on docker hub.

  • First pull image from docker hub:
    $ docker pull nichelia/ekrhizoc:{version}
  • Execute CLI via docker run:
    $ docker run --rm -it -v ~/ekrhizoc_bin:/tmp/bin nichelia/ekrhizoc:{version} {command}
    This command mounts the application's bin (outcome) to user's root directory under ekrhizoc_bin folder.

where version is the published application version

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ekrhizoc-0.1.2.tar.gz (16.9 kB view hashes)

Uploaded Source

Built Distribution

ekrhizoc-0.1.2-py3-none-any.whl (17.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page