Multiprocessing application to download and analyze a content of an html pages.
Project description
This is the documentation for the Shaman. Multiprocessing application to combine different singular handlers against one message.
The initial purpose was to create a tool, that:
- would make possible to download and analyze a content of an html pages.
- simple enough to add a new functionality in it.
- hast to be scalable (multiprocessing).
Actual usage can be different from it. There are some spontaneous ideas:
- scanning a mongo collection and parsing documents in parallel
- parsing a lot of lines from multiple huge files, saving the results to any database (depending on the results)
There are three parts in the shaman library::
* stages (actual processors, which do represent some functionality)
* consumer (worker, that run them all in a particular order)
* daemon (run as many as needed workers. Also used as a CLI unstrument.)
All stages are run in a particular order and use the same message object (inside one worker).
INSTALLATION:
----------------------
Run the command::
pip install shaman
If everything is ok, you should be able to run::
shaman --help
It has to display::
usage: shaman [-h] [-i | -d] -c CONFIGURATION [--drop_first DROP_FIRST]
[-p PRINT_FIELDS [PRINT_FIELDS ...]]
[-r REMOVE_FIELDS [REMOVE_FIELDS ...]]
[--ignore_after IGNORE_AFTER]
[{stop,start,restart,} [{stop,start,restart,} ...]]
Main shaman module. Use it to start|stop|restart daemon or start non-daemon
modes of shaman
positional arguments:
{stop,start,restart,}
Command to daemon (default: )
optional arguments:
-h, --help show this help message and exit
-i Use stdin input as main input (default: False)
-d Daemonize main process (default: False)
-c CONFIGURATION Path to configuration file (default: None)
--drop_first DROP_FIRST
drop first lines (default: 0)
-p PRINT_FIELDS [PRINT_FIELDS ...], --print_fields PRINT_FIELDS [PRINT_FIELDS ...]
-r REMOVE_FIELDS [REMOVE_FIELDS ...], --remove_fields REMOVE_FIELDS [REMOVE_FIELDS ...]
--ignore_after IGNORE_AFTER
CONFIGURATION:
---------------------------
You may find an example configuration file in <path_to_python_lib>/site-packages/shaman/etc/crawler.config
It includes 4 stages::
reading from stdin
downloading page
detecting charset
print url, charset
By default, all stages reside in <path_to_python_lib>/site-packages/shaman/src/analyzers/ folder.
You may create your custom stage and put it into the custom folder.
There is a parameter in a configuration file::
custom_stage_dir = <custom_folder>
If you put some stages into this folder, shaman will also "see" them.
To check if anything is working, please, run::
echo "http://google.ru" | shaman -c <path_to_config> -i
More information about the package::
http://shaman.readthedocs.io/en/latest/
Github::
https://github.com/Landish145/shaman
The initial purpose was to create a tool, that:
- would make possible to download and analyze a content of an html pages.
- simple enough to add a new functionality in it.
- hast to be scalable (multiprocessing).
Actual usage can be different from it. There are some spontaneous ideas:
- scanning a mongo collection and parsing documents in parallel
- parsing a lot of lines from multiple huge files, saving the results to any database (depending on the results)
There are three parts in the shaman library::
* stages (actual processors, which do represent some functionality)
* consumer (worker, that run them all in a particular order)
* daemon (run as many as needed workers. Also used as a CLI unstrument.)
All stages are run in a particular order and use the same message object (inside one worker).
INSTALLATION:
----------------------
Run the command::
pip install shaman
If everything is ok, you should be able to run::
shaman --help
It has to display::
usage: shaman [-h] [-i | -d] -c CONFIGURATION [--drop_first DROP_FIRST]
[-p PRINT_FIELDS [PRINT_FIELDS ...]]
[-r REMOVE_FIELDS [REMOVE_FIELDS ...]]
[--ignore_after IGNORE_AFTER]
[{stop,start,restart,} [{stop,start,restart,} ...]]
Main shaman module. Use it to start|stop|restart daemon or start non-daemon
modes of shaman
positional arguments:
{stop,start,restart,}
Command to daemon (default: )
optional arguments:
-h, --help show this help message and exit
-i Use stdin input as main input (default: False)
-d Daemonize main process (default: False)
-c CONFIGURATION Path to configuration file (default: None)
--drop_first DROP_FIRST
drop first lines (default: 0)
-p PRINT_FIELDS [PRINT_FIELDS ...], --print_fields PRINT_FIELDS [PRINT_FIELDS ...]
-r REMOVE_FIELDS [REMOVE_FIELDS ...], --remove_fields REMOVE_FIELDS [REMOVE_FIELDS ...]
--ignore_after IGNORE_AFTER
CONFIGURATION:
---------------------------
You may find an example configuration file in <path_to_python_lib>/site-packages/shaman/etc/crawler.config
It includes 4 stages::
reading from stdin
downloading page
detecting charset
print url, charset
By default, all stages reside in <path_to_python_lib>/site-packages/shaman/src/analyzers/ folder.
You may create your custom stage and put it into the custom folder.
There is a parameter in a configuration file::
custom_stage_dir = <custom_folder>
If you put some stages into this folder, shaman will also "see" them.
To check if anything is working, please, run::
echo "http://google.ru" | shaman -c <path_to_config> -i
More information about the package::
http://shaman.readthedocs.io/en/latest/
Github::
https://github.com/Landish145/shaman
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
shaman-0.1.0.dev1.zip
(134.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shaman-0.1.0.dev1.zip.
File metadata
- Download URL: shaman-0.1.0.dev1.zip
- Upload date:
- Size: 134.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec43ad608518a36dea9143bf3c2453571437d33f145f55e03677acb8752d941a
|
|
| MD5 |
c79a210c0a3373acb44e1f91e6041959
|
|
| BLAKE2b-256 |
1f0a1397db7e59a6e7f738e1fc3d1ce6dab31332c7996f8ab06a3b0340661699
|
File details
Details for the file shaman-0.1.0.dev1-py2.py3-none-any.whl.
File metadata
- Download URL: shaman-0.1.0.dev1-py2.py3-none-any.whl
- Upload date:
- Size: 130.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
655e89643066e71682e2502017c4653645d4fd3fea44c104f524b176f5155a0a
|
|
| MD5 |
623a65370cf4b546e38477fa6e73ac52
|
|
| BLAKE2b-256 |
d013b05fa14de0a16b67720db549456595fd8c02bde156d08de2cf14bb5f34bb
|