Skip to main content

Multiprocessing application to download and analyze a content of an html pages.

Project description

This is the documentation for the Shaman. Multiprocessing application to combine different singular handlers against one message.

The initial purpose was to create a tool, that:
- would make possible to download and analyze a content of an html pages.
- simple enough to add a new functionality in it.
- hast to be scalable (multiprocessing).
Actual usage can be different from it. There are some spontaneous ideas:
- scanning a mongo collection and parsing documents in parallel
- parsing a lot of lines from multiple huge files, saving the results to any database (depending on the results)

There are three parts in the shaman library:
* stages (actual processors, which do represent some functionality)
* consumer (worker, that run them all in a particular order)
* daemon (run as many as needed workers. Also used as a CLI unstrument.)
All stages are run in a particular order and use the same message object (inside one worker).

INSTALLATION:
pip install shaman

If everything is ok, you should be able to run:
shaman --help

It has to display:

usage: shaman [-h] [-i | -d] -c CONFIGURATION [--drop_first DROP_FIRST]
[-p PRINT_FIELDS [PRINT_FIELDS ...]]
[-r REMOVE_FIELDS [REMOVE_FIELDS ...]]
[--ignore_after IGNORE_AFTER]
[{stop,start,restart,} [{stop,start,restart,} ...]]

Main shaman module. Use it to start|stop|restart daemon or start non-daemon
modes of shaman

positional arguments:
{stop,start,restart,}
Command to daemon (default: )

optional arguments:
-h, --help show this help message and exit
-i Use stdin input as main input (default: False)
-d Daemonize main process (default: False)
-c CONFIGURATION Path to configuration file (default: None)
--drop_first DROP_FIRST
drop first lines (default: 0)
-p PRINT_FIELDS [PRINT_FIELDS ...], --print_fields PRINT_FIELDS [PRINT_FIELDS ...]
-r REMOVE_FIELDS [REMOVE_FIELDS ...], --remove_fields REMOVE_FIELDS [REMOVE_FIELDS ...]
--ignore_after IGNORE_AFTER

CONFIGURATION:

You may find an example configuration file in <path_to_python_lib>/site-packages/shaman/etc/crawler.config
It includes 4 stages:

reading from stdin
downloading page
detecting charset
print url, charset

By default, all stages reside in <path_to_python_lib>/site-packages/shaman/src/analyzers/ folder.
You may create your custom stage and put it into the custom folder.
There is a parameter in a configuration file:

custom_stage_dir = <custom_folder>

If you put some stages into this folder, shaman will also "see" them.

To check if anything is working, please, run:

echo "http://google.ru" | shaman -c <path_to_config> -i

More information about the package: `here
<http://shaman.readthedocs.io/en/latest/>`_.
Github: `<https://github.com/Landish145/shaman>`_.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shaman-0.0.6.dev1.zip (134.0 kB view hashes)

Uploaded Source

Built Distribution

shaman-0.0.6.dev1-py2.py3-none-any.whl (130.4 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page