A simple python web crawler
Project description
ekrhizoc
ekrhizoc (E6c): A web crawler
Contents
Definition
εκρίζωση (Greek) ekrízosi / uprooting, eradication
Also known as E6c.
Use Case
Implementation of a simple python web crawler.
Input: URL (seed).
Output: simple textual sitemap (to show links between pages).
Requirements
- The crawler is limited to one subdomain (exclude external links).
- No use of web crawling libraries/frameworks (e.g. scrapy).
- (Optional) Use of HTML handling Libraries/Frameworks.
- Production-ready code.
Assumptions
- The input URL (seed) is limited to only one at every run.
- The targeted URL(s) are static pages (no backend javascript parsing required).
- Links to be extracted from HTML anchor
<a>
elements. - Valid links include
- Valid URL
- Non empty
- Matches a valid url pattern
- Does not exceed the
E6C_MAX_URL_LENGTH
length in characters - Possible to convert a relative urls to a full url
- Link is not visited before
- Link is not part of an ignored file type
- Link has the same domain as the seed url
- Link is not restricted by the robots.txt file
- Valid URL
Design
This project implements a Basic Universal Crawler based on breadth first search graph traversal.
Configuration
Behaviour of the application can be configured via Environment Variables.
Environment Variable | Description | Type | Default Value |
---|---|---|---|
E6C_LOG_LEVEL |
Level of logging - overrides verbose/quiet flag | string | - |
E6C_LOG_DIR |
Directory to save logs | string | - |
E6C_BIN_DIR |
Directory to save any output (bin) | string | bin |
E6C_IGNORE_FILETYPES |
File types of websites to ignore (e.g. ".filetype1,.filetype2") | string | ".png,.pdf,.txt,.doc,.jpg,.gif" |
E6C_URL_REQUEST_TIMER |
Time (in seconds) to wait per request (not to populate server with multiple requests) | float | 0.1 |
E6C_MAX_URLS |
The maximum number of urls to fetch/crawl | integer | 10000 |
E6C_MAX_URL_LENGTH |
The maximum length (character count) of a url to fetch/crawl | integer | 300 |
Development
Configure for local development
- Clone repo on your local machine
- Install
conda
orminiconda
- Create symlink for githooks (based on
isort
,black
):
$ make git-hooks
- Create your local project environment (based on
conda
,poetry
):
$ make env
- (Optional) Update existing local project environment:
$ make env-update
Run locally
On a terminal, run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- Run the CLI using
poetry
:
$ poetry run ekrhizoc
Contribute
[ Not Available ]
Testing
[ Work in progress... ]
To run the tests, open a terminal and run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- To run pytest:
$ pytest -v
- To check test coverage:
$ pytest --cov=. --cov-report=term-missing
Versioning
Increment the version number:
$ poetry version {bump rule}
where valid bump rules are:
- patch
- minor
- major
- prepatch
- preminor
- premajor
- prerelease
Changelog
Use CHANGELOG.md
to track the evolution of this package.
The [UNRELEASED]
tag at the top of the file should always be there to log the work until a release occurs.
Work should be logged under one of the following subtitles:
- Added
- Changed
- Fixed
- Removed
On a release, a version of the following format should be added to all the current unreleased changes in the file.
## [major.minor.patch] - YYYY-MM-DD
Deployment
Pip package
On a terminal, run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- To build pip package:
$ make build-package
- To publish pip package (requires credentials to PyPi):
$ make publish-package
Docker image
On a terminal, run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- To build docker image:
$ make build-docker
Production
For production, a Docker image is used. This image is published publicly on docker hub.
- First pull image from docker hub:
$ docker pull nichelia/ekrhizoc:[version]
- First pull image from docker hub:
$ docker run --rm -it -v ~/ekrhizoc_bin:/usr/src/bin nichelia/ekrhizoc:[version]
This command mounts the application's bin (outcome) to user's root directory under ekrhizoc_bin folder.
where version is the published application version (e.g. 0.1.0)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.