texta-parsers

Project description

texta-parsers

A Python package for file parsing.

The main class in the package is DocParser. The package also supports sophisticated parsing of emails which is implemented in class EmailParser. If you only need to parse emails then you can specify it with parameter parse_only_extensions. It is possible to use EmailParser independently as well but then attachments will not be parsed.

Requirements

NB! Starting from version 3.0.0, only Elasticsearch 8 clusters are supported.

Most of the file types are parsed with tika. Other tools that are required:

Tool	File Type
pst-utils	.pst
digidoc-tool	.ddoc .bdoc .asics .asice
rar-nonfree	.rar
lxml	XML HTML

Installation of required packages on Ubuntu/Debian:

sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y

sudo sh install-digidoc.sh

Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:

sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest

Installation

Base install (without MLP & Face Analyzer):

pip install texta-parsers

Install with MLP:

pip install texta-parsers[mlp]

Install with whole bundle:

pip install texta-parsers[mlp]

Testing

python -m pytest -rx -v tests

Description

DocParser

A file parser. Input can either be in bytes or a path to the file as a string. See user guide more information. DocParser also includes EmailParser.

EmailParser

For parsing email messages and mailboxes. Supported file formats are Outlook Data File (.pst), mbox (.mbox) and EML (.eml). Can be used separately from DocParser but then attachments are not parsed. User guide can be found here and documentation here.

Project details

Release history Release notifications | RSS feed

This version

3.0.0

Sep 23, 2024

2.8.2

Nov 8, 2023

2.8.1

Aug 28, 2023

2.7.19

Feb 10, 2022

2.7.18

Jan 28, 2022

2.7.17

Jan 27, 2022

2.7.16

Jan 4, 2022

2.7.15

Jan 3, 2022

2.7.14

Jan 3, 2022

2.7.13

Dec 10, 2021

2.7.10

Mar 23, 2021

2.7.9

Mar 22, 2021

2.7.8

Mar 17, 2021

2.7.7

Mar 16, 2021

2.7.6

Mar 12, 2021

2.7.5

Mar 11, 2021

2.7.4

Mar 9, 2021

2.7.3

Mar 8, 2021

2.7.2

Mar 8, 2021

2.7.1

Mar 2, 2021

2.7.0

Feb 24, 2021

2.6.4

Feb 15, 2021

2.6.3

Feb 8, 2021

2.6.2

Feb 4, 2021

2.6.1

Feb 4, 2021

2.6.0

Feb 3, 2021

2.5.2

Feb 1, 2021

2.5.1

Jan 28, 2021

2.5.0

Jan 12, 2021

2.4.10

Dec 16, 2020

2.4.9

Nov 9, 2020

2.4.7

Nov 5, 2020

2.4.6

Nov 3, 2020

2.4.4

Oct 30, 2020

2.4.3

Oct 28, 2020

2.4.2

Oct 21, 2020

2.4.0

Oct 12, 2020

2.3.2

Sep 14, 2020

2.3.1

Sep 2, 2020

2.3.0

Aug 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texta-parsers-3.0.0.tar.gz (34.4 kB view details)

Uploaded Sep 23, 2024 Source

File details

Details for the file texta-parsers-3.0.0.tar.gz.

File metadata

Download URL: texta-parsers-3.0.0.tar.gz
Upload date: Sep 23, 2024
Size: 34.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for texta-parsers-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`01a5f0705f04117743a68ae3ccf8f21c86a7459676c6ae2339f9dc5568e5bbea`
MD5	`2cc2e79306780a13ac429cdd37675619`
BLAKE2b-256	`bd8e2e50f416119fbec1196503d727b80663736b55e3b8aff2032db3cc243bb9`

See more details on using hashes here.

texta-parsers 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

texta-parsers

Requirements

Installation

Testing

Description

DocParser

EmailParser

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes