ACT SCIO
Project description
act-scio2
Scio v2 is a reimplementation of Scio in Python3.
Scio uses tika to extract text from documents (PDF, HTML, DOC, etc).
The result is sent to the Scio Analyzer that extracts information using a combination of NLP (Natural Language Processing) and pattern matching.
Changelog
0.0.42
SCIO now supports setting TLP on data upload, to annotate documents with tlp
tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds.
Source code
The source code the workers are available on github.
Setup
To setup, first install from PyPi:
sudo pip3 install act-scio
You will also need to install beanstalkd. On debian/ubuntu you can run:
sudo apt install beanstalkd
Configure beanstalk to accept larger payloads with the -z
option. For red hat derived setups this can be configured in /etc/sysconfig/beanstalkd
:
MAX_JOB_SIZE=-z 524288
You then need to install NLTK data files. A helper utility to do this is included:
scio-nltk-download
You will also need to create a default configuration:
scio-config user
API
To run the api, execute:
scio-api
This will setup the API on 127.0.0.1:3000. Use --port <PORT> and --host <IP>
to listen on another port and/or another interface.
For documentation of the API endpoint see API.md.
Configuration
You can create a default configuration using this command (should be run as the user running scio):
scio-config user
Common configuration can be found under ~/.config/scio/etc/scio.ini
Running Manually
Scio Tika Server
The Scio Tika server reads jobs from the beanstalk tube scio_doc
and the extracted text will be sent to the tube scio_analyze
.
The first time the server runs, it will download tika using maven. It will use a proxy if $https_proxy
is set.
scio-tika-server
scio-tika-server
uses tika-python which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: https://github.com/chrismattmann/tika-python. Currently only tested with tika-server version 2.7.0.
Scio Analyze Server
Scio Analyze Server reads (by default) jobs from the beanstalk tube scio_analyze
.
scio-analyze
You can also read directly from stdin like this:
echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch=
Scio Submit
Submit document (from file or URI) to scio_api
.
Example:
scio-submit \
--uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \
--scio-baseuri http://localhost:3000/submit \
--tlp white
Running as a service
Systemd compatible service scripts can be found under examples/systemd.
To install:
sudo cp examples/systemd/*.service /usr/lib/systemd/system
sudo systemctl enable scio-tika-server
sudo systemctl enable scio-analyze
sudo service start scio-tika-server
sudo service start scio-analyze
scio-feed cron job
To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists):
# Fetch scio feeds every hour
0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1
# Delete logs from scio-feeds older than 7 days
0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \;
Local development
Use pip to install in local development mode. act-scio uses namespacing, so it is not compatible with using setup.py install
or setup.py develop
.
In repository, run:
pip3 install --user -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for act_scio-0.0.62-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a635d12d06b7a537ccfdc67ca41d96b86bdbb2c8bf6d64d0f802bd896da017d4 |
|
MD5 | a01161af9bbdfb7fe08d7d36128beb48 |
|
BLAKE2b-256 | 44e2481e8d5dc0e4066d5fdcd0af6528eab3e2592d97184979e189fa7d257e06 |