documentparser

A simple CLI tool that allow to extract all text contained into a document.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 2.7

Project description

#Document Parser

A simple CLI tool that allow to extract all text contained into a document.

Installation

Execute the followings command to before installing DocumentParser

Debian/Ubuntu

sudo apt-get update
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx
pip install docparser

MacOSx

brew install pkg-config poppler
brew cask install xquartz
brew install poppler antiword unrtf tesseract swig

Fedora / CentOS

Before you start you've to know that there's no a quickly way to install DocParser in a Fedora based system. This is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.

yum -y update
yum install python-pip

Required by the .docx parser which uses lxml via python-docx.

yum install libxml2 libxslt-devel libxml2-devel

Required by the .docx parser which users lxml via python-docx.

yum install libxslt

Required by the .doc and .ps parser.

wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm
rpm -Uvh cert-forensics-tools-release*rpm
yum --enablerepo=forensics install antiword
yum --enablerepo=forensics install pstotext

Require by .pdf parser

*yum install poppler-utils

Requred by .jpg, .png, gif parser

cd /opt
yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel

Install AutoConf-Archive

wget ftp://mirror.switch.ch/pool/4/mirror/epel/7/ppc64/a/autoconf-archive-2016.09.16-1.el7.noarch.rpm
rpm -i autoconf-archive-2016.09.16-1.el7.noarch.rpm

Install Leptonica from Source

wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz
tar -zxvf leptonica-1.75.3.tar.gz
cd leptonica-1.75.3
./autobuild
./configure
make
make install
cd ..

Install Tesseract from Source

wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
tar -zxvf 3.05.01.tar.gz
cd tesseract-3.05.01
./autogen.sh
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
make install
ldconfig
cd ..

Download and install tesseract language files

wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/ben.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/tha.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
mv *.traineddata /usr/local/share/tessdata

Download Hindi Cube data

wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.bigrams
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.fold
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.lm
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.nn
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.params
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.word-freq
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.tesseract_cube.nn
mv hin.* /usr/local/share/tessdata
ln -s /opt/tesseract-3.05.01 /opt/tesseract-latest

Required by .mp3 and .ogg parser

yum install sox
rm cert-forensics-tools-release-el7.rpm

Install textract without unsupported features

git clone https://github.com/deanmalmgren/textract.git
rm textract/requirements/python && cp requirements/textract/python textract/requirements/python
cd textract && chmod +x setup.py
python setup.py install
yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 2.7

Release history Release notifications | RSS feed

This version

1.0a1 pre-release

Jun 9, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

documentparser-1.0a1-py2-none-any.whl (3.5 kB view details)

Uploaded Jun 9, 2018 Python 2

File details

Details for the file documentparser-1.0a1-py2-none-any.whl.

File metadata

Download URL: documentparser-1.0a1-py2-none-any.whl
Upload date: Jun 9, 2018
Size: 3.5 kB
Tags: Python 2
Uploaded using Trusted Publishing? No

File hashes

Hashes for documentparser-1.0a1-py2-none-any.whl
Algorithm	Hash digest
SHA256	`93137d65ee7193d4b8e5704dc4438ad31b9ec2a9954b5b648e9c8f1adcdc6f12`
MD5	`b113aa4daf6e8fd47b15b27e1cfeb1b6`
BLAKE2b-256	`e3d76290a56f1b37f90194b965fdd00dbdd0ecfaaf3779bc890ef58b57dc8888`

See more details on using hashes here.

documentparser 1.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Debian/Ubuntu

MacOSx

Fedora / CentOS

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes