Skip to main content

Multiprocessing application to download and analyze a content of an html pages.

Project description

This is the documentation for the Shaman. Multiprocessing application to combine different singular handlers against one message.

The initial purpose was to create a tool, that:
- would make possible to download and analyze a content of an html pages.
- simple enough to add a new functionality in it.
- hast to be scalable (multiprocessing).
Actual usage can be different from it. There are some spontaneous ideas:
- scanning a mongo collection and parsing documents in parallel
- parsing a lot of lines from multiple huge files, saving the results to any database (depending on the results)

There are three parts in the shaman library::

* stages (actual processors, which do represent some functionality)
* consumer (worker, that run them all in a particular order)
* daemon (run as many as needed workers. Also used as a CLI unstrument.)
All stages are run in a particular order and use the same message object (inside one worker).

INSTALLATION:
----------------------
Run the command::

pip install shaman

If everything is ok, you should be able to run::


shaman --help

It has to display::

usage: shaman [-h] [-i | -d] -c CONFIGURATION [--drop_first DROP_FIRST]
[-p PRINT_FIELDS [PRINT_FIELDS ...]]
[-r REMOVE_FIELDS [REMOVE_FIELDS ...]]
[--ignore_after IGNORE_AFTER]
[{stop,start,restart,} [{stop,start,restart,} ...]]

Main shaman module. Use it to start|stop|restart daemon or start non-daemon
modes of shaman

positional arguments:
{stop,start,restart,}
Command to daemon (default: )

optional arguments:
-h, --help show this help message and exit
-i Use stdin input as main input (default: False)
-d Daemonize main process (default: False)
-c CONFIGURATION Path to configuration file (default: None)
--drop_first DROP_FIRST
drop first lines (default: 0)
-p PRINT_FIELDS [PRINT_FIELDS ...], --print_fields PRINT_FIELDS [PRINT_FIELDS ...]
-r REMOVE_FIELDS [REMOVE_FIELDS ...], --remove_fields REMOVE_FIELDS [REMOVE_FIELDS ...]
--ignore_after IGNORE_AFTER

CONFIGURATION:
---------------------------

You may find an example configuration file in <path_to_python_lib>/site-packages/shaman/etc/crawler.config
It includes 4 stages::

reading from stdin
downloading page
detecting charset
print url, charset

By default, all stages reside in <path_to_python_lib>/site-packages/shaman/src/analyzers/ folder.
You may create your custom stage and put it into the custom folder.
There is a parameter in a configuration file::

custom_stage_dir = <custom_folder>

If you put some stages into this folder, shaman will also "see" them.

To check if anything is working, please, run::

echo "http://google.ru" | shaman -c <path_to_config> -i

More information about the package::

http://shaman.readthedocs.io/en/latest/
Github::

https://github.com/Landish145/shaman

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shaman-0.1.0.dev1.zip (134.1 kB view details)

Uploaded Source

Built Distribution

shaman-0.1.0.dev1-py2.py3-none-any.whl (130.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file shaman-0.1.0.dev1.zip.

File metadata

  • Download URL: shaman-0.1.0.dev1.zip
  • Upload date:
  • Size: 134.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for shaman-0.1.0.dev1.zip
Algorithm Hash digest
SHA256 ec43ad608518a36dea9143bf3c2453571437d33f145f55e03677acb8752d941a
MD5 c79a210c0a3373acb44e1f91e6041959
BLAKE2b-256 1f0a1397db7e59a6e7f738e1fc3d1ce6dab31332c7996f8ab06a3b0340661699

See more details on using hashes here.

File details

Details for the file shaman-0.1.0.dev1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for shaman-0.1.0.dev1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 655e89643066e71682e2502017c4653645d4fd3fea44c104f524b176f5155a0a
MD5 623a65370cf4b546e38477fa6e73ac52
BLAKE2b-256 d013b05fa14de0a16b67720db549456595fd8c02bde156d08de2cf14bb5f34bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page