texta-parsers
Project description
texta-parsers
A Python package for file parsing.
The main class in the package is DocParser. The package also supports sophisticated parsing of emails which is implemented in class EmailParser. If you only need to parse emails then you can specify it with parameter parse_only_extensions
. It is possible to use EmailParser independently as well but then attachments will not be parsed.
Requirements
NB! Starting from version 3.0.0, only Elasticsearch 8 clusters are supported.
Most of the file types are parsed with tika. Other tools that are required:
Tool | File Type |
---|---|
pst-utils | .pst |
digidoc-tool | .ddoc .bdoc .asics .asice |
rar-nonfree | .rar |
lxml | XML HTML |
Installation of required packages on Ubuntu/Debian:
sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y
sudo sh install-digidoc.sh
Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:
sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest
Installation
Base install (without MLP & Face Analyzer):
pip install texta-parsers
Install with MLP:
pip install texta-parsers[mlp]
Install with whole bundle:
pip install texta-parsers[mlp]
Testing
python -m pytest -rx -v tests
Description
DocParser
A file parser. Input can either be in bytes or a path to the file as a string. See user guide more information. DocParser also includes EmailParser.
EmailParser
For parsing email messages and mailboxes. Supported file formats are Outlook Data File (.pst), mbox (.mbox) and EML (.eml). Can be used separately from DocParser but then attachments are not parsed. User guide can be found here and documentation here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file texta-parsers-3.0.0.tar.gz
.
File metadata
- Download URL: texta-parsers-3.0.0.tar.gz
- Upload date:
- Size: 34.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01a5f0705f04117743a68ae3ccf8f21c86a7459676c6ae2339f9dc5568e5bbea |
|
MD5 | 2cc2e79306780a13ac429cdd37675619 |
|
BLAKE2b-256 | bd8e2e50f416119fbec1196503d727b80663736b55e3b8aff2032db3cc243bb9 |