Skip to main content

Multiprocessing application to download and analyze a content of an html pages.

Project description

This is the documentation for the Shaman. Multiprocessing application to combine different singular handlers against one message.

The initial purpose was to create a tool, that:
- would make possible to download and analyze a content of an html pages.
- simple enough to add a new functionality in it.
- hast to be scalable (multiprocessing).
Actual usage can be different from it. There are some spontaneous ideas:
- scanning a mongo collection and parsing documents in parallel
- parsing a lot of lines from multiple huge files, saving the results to any database (depending on the results)

There are three parts in the shaman library::

* stages (actual processors, which do represent some functionality)
* consumer (worker, that run them all in a particular order)
* daemon (run as many as needed workers. Also used as a CLI unstrument.)
All stages are run in a particular order and use the same message object (inside one worker).

Run the command::

pip install shaman

If everything is ok, you should be able to run::

shaman --help

It has to display::

usage: shaman [-h] [-i | -d] -c CONFIGURATION [--drop_first DROP_FIRST]
[--ignore_after IGNORE_AFTER]
[{stop,start,restart,} [{stop,start,restart,} ...]]

Main shaman module. Use it to start|stop|restart daemon or start non-daemon
modes of shaman

positional arguments:
Command to daemon (default: )

optional arguments:
-h, --help show this help message and exit
-i Use stdin input as main input (default: False)
-d Daemonize main process (default: False)
-c CONFIGURATION Path to configuration file (default: None)
--drop_first DROP_FIRST
drop first lines (default: 0)
--ignore_after IGNORE_AFTER


You may find an example configuration file in <path_to_python_lib>/site-packages/shaman/etc/crawler.config
It includes 4 stages::

reading from stdin
downloading page
detecting charset
print url, charset

By default, all stages reside in <path_to_python_lib>/site-packages/shaman/src/analyzers/ folder.
You may create your custom stage and put it into the custom folder.
There is a parameter in a configuration file::

custom_stage_dir = <custom_folder>

If you put some stages into this folder, shaman will also "see" them.

To check if anything is working, please, run::

echo "" | shaman -c <path_to_config> -i

More information about the package::

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for shaman, version 0.1.0.dev1
Filename, size File type Python version Upload date Hashes
Filename, size shaman-0.1.0.dev1-py2.py3-none-any.whl (130.4 kB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size (134.1 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page