Skip to main content

A high-level web scraping framework

Project description

Okami

Okami is a high-level web scraping framework built entirely for Python 3.6+ using asynchronous model provided by standard library asyncio module with aiohttp as a networking layer and lxml for parsing data.

Architecture is entirely modular and main components can be swapped out and replaced with custom implementations.

Features

  • complete website-wide page processing
  • full scraping mode or delta mode scraping only unvisited pages
  • immediate, on-demand or real-time page processing over HTTP API
  • single page processing via command line
  • lots of pipelines, middlewares and signals

Spiders are very simple implementations. Take a look at an example here.

Quick start

  • Install okami

    • pip install okami
  • Run example web server

    • OKAMI_SETTINGS=okami.cfg.example okami example server

Open localhost:8000 and browse around a little. Quite a remarkable website. We will run our example spider against this website shortly and process few items.

  • Run example spider

    • OKAMI_SETTINGS=okami.cfg.example okami example spider

Our example spider started and you can see it processing pages. Take a look at an example spider implementation here.

Documentation

Read the rest of documentation here.

License

Okami is licensed under a three clause BSD License. Full license text can be found here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
okami-0.2.0-py2.py3-none-any.whl (25.1 kB) Copy SHA256 hash SHA256 Wheel py2.py3 Aug 18, 2018
okami-0.2.0.tar.gz (20.5 kB) Copy SHA256 hash SHA256 Source None Aug 18, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page