Skip to main content

Set of Python tools for the RATOM project

Project description

Logo

libratom

PyPI version Build Status codecov Codacy Badge Twitter Follow

Python library and supporting utilities to parse and process PST and MBOX email sources.

This project is under development

Installation

Libratom requires Python 3.6 or newer, and can be installed via the Python Package Index (PyPI). Installing via pip will automatically install all required dependencies.

To install and test this software in a new Python virtual environment in Ubuntu 16.04LTS or newer:

Make sure Python 3.6 or newer, python3-pip, and python3-venv are installed:

sudo apt install python3 python3-pip python3-venv

Create and activate a Python virtual environment:

python3 -m venv venv
source venv/bin/activate

Make sure pip is upgraded to the latest version:

pip install --upgrade pip

Install libratom:

pip install libratom

Entity extraction

Libratom provides a CLI with planned support for a range of email processing tasks. Currently, the CLI supports entity extraction from individual PST and mbox files, or directories containing one or more PST and mbox files.

To see available commands, type:

(venv) user@host:~$ ratom -h

To see detailed help for the entity extraction command, type:

(venv) user@host:~$ ratom entities -h

To run the extractor with default settings over a PST or mbox file, or a directory containing one or more PST and mbox files, type the following:

(venv) user@host:~$ ratom entities -p /path/to/PST-or-mbox-file-or-directory

Progress is displayed in a bar at the bottom of the window. To terminate a job early and shut down all workers, type Ctrl-C.

By default, the tool will use the spaCy en_core_web_sm model, and will start as many concurrent jobs as there are virtual cores available. Entities are written to a sqlite3 file automatically named using the existing file or directory name and current datetime stamp, and with the following schema:

RATOM database schema

The schema contains 3 tables representing file information, message information and entity information.

In the entity table, text is the entity instance, label_ is the entity type, filepath is the PST or mbox file associated with this entity. Full message and file information for each entity are also available through message_id and file_report_id respectively. Note that pff_identifier (a message ID specific to PST files) will not be populated for messages located in mbox files. Examples of how to query these tables can be found in the Interactive examples section near the end of this README.

Advanced CLI uses

The CLI provides additional flags to tune performance, output location, and verbosity of the tool. Some example use cases are provided below.

To use a different entity model, use the --spacy-model flag. The following example directs the tool to use the multi-language model:

(venv) user@host:~$ ratom entities -p --spacy-model xx_ent_wiki_sm /path/to/PST-or-mbox-file-or-directory

To specify the number of jobs that may be run concurrently, use the -j flag. The following example sets the number of concurrent jobs to 2:

(venv) user@host:~$ ratom entities -p -j 2 /path/to/PST-or-mbox-file-or-directory

To change the name or location used for the sqlite3 output file, use the -o flag. Specifying a directory will result in the automatically named file being written to that path. Specifying a path that includes a filename will force the use of that filename. In the following example, the sqlite3 database will be named filename.db:

(venv) user@host:~$ ratom entities -p -o /path/to/directory/filename.db /path/to/PST-or-mbox-file-or-directory

To view more detailed output during the job (for example, if you encounter unexpected failures), you can increase the level of output verbosity with the -v flag. Additional v's increase verbosity. In the following example, we have increased verbosity to level 2:

(venv) user@host:~$ ratom entities -p -vv /path/to/PST-or-mbox-file-or-directory

Interactive examples

More usage documentation will appear here as the project matures. For now, you can try out some of the functionality in Jupyter notebooks we've prepared at:

https://github.com/libratom/ratom-notebooks

License(s)

Logos, documentation, and other non-software products of the RATOM team are distributed under the terms of Creative Commons 4.0 Attribution. Software items in RATOM repositories are distributed under the terms of the MIT License. See the LICENSE file for additional details.

© 2019, The University of North Carolina at Chapel Hill.

Development Team and Support

Developed by the RATOM team at the University of North Carolina at Chapel Hill.

See https://ratom.web.unc.edu for additional project details, staff bios, and news.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libratom-0.2.1.dev4.tar.gz (35.9 kB view details)

Uploaded Source

Built Distribution

libratom-0.2.1.dev4-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file libratom-0.2.1.dev4.tar.gz.

File metadata

  • Download URL: libratom-0.2.1.dev4.tar.gz
  • Upload date:
  • Size: 35.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1

File hashes

Hashes for libratom-0.2.1.dev4.tar.gz
Algorithm Hash digest
SHA256 933eec5013d110400757a95f9d0bb85c02d1a9941f9ac20d6ab817865bbcdca1
MD5 eee3d172313fa762fdf80d52849c32a0
BLAKE2b-256 4337c08df7546828d6b021253c49bef85829e612c3c65e09be68f20dd6fb72fe

See more details on using hashes here.

File details

Details for the file libratom-0.2.1.dev4-py3-none-any.whl.

File metadata

  • Download URL: libratom-0.2.1.dev4-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.1

File hashes

Hashes for libratom-0.2.1.dev4-py3-none-any.whl
Algorithm Hash digest
SHA256 b8a3974f106b2f60ef36f5b47cf316b95a9428c720da0e67fedece68839b170f
MD5 9d747c0ec99f070dc86d51e7f4f5fe5d
BLAKE2b-256 e78945a5c3394044a34a69f267f1584a6d593c504b11adc0c6ccf11cc135d638

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page