A simple CLI tool that allow to extract all text contained into a document.
Project description
#Document Parser
A simple CLI tool that allow to extract all text contained into a document.
Installation
Execute the followings command to before installing DocumentParser
Debian/Ubuntu
- sudo apt-get update
- sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
- apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx - pip install docparser
MacOSx
- brew install pkg-config poppler
- brew cask install xquartz
- brew install poppler antiword unrtf tesseract swig
Fedora / CentOS
Before you start you've to know that there's no a quickly way to install DocParser in a Fedora based system. This is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.
- yum -y update
- yum install python-pip
Required by the .docx parser which uses lxml via python-docx.
- yum install libxml2 libxslt-devel libxml2-devel
Required by the .docx parser which users lxml via python-docx.
- yum install libxslt
Required by the .doc and .ps parser.
- wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm
- rpm -Uvh cert-forensics-tools-release*rpm
- yum --enablerepo=forensics install antiword
- yum --enablerepo=forensics install pstotext
Require by .pdf parser
*yum install poppler-utils
Requred by .jpg, .png, gif parser
-
cd /opt
-
yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
Install AutoConf-Archive
- wget ftp://mirror.switch.ch/pool/4/mirror/epel/7/ppc64/a/autoconf-archive-2016.09.16-1.el7.noarch.rpm
- rpm -i autoconf-archive-2016.09.16-1.el7.noarch.rpm
Install Leptonica from Source
- wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz
- tar -zxvf leptonica-1.75.3.tar.gz
- cd leptonica-1.75.3
- ./autobuild
- ./configure
- make
- make install
- cd ..
Install Tesseract from Source
- wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
- tar -zxvf 3.05.01.tar.gz
- cd tesseract-3.05.01
- ./autogen.sh
- PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
- LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
- make install
- ldconfig
- cd ..
Download and install tesseract language files
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/ben.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/tha.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
- mv *.traineddata /usr/local/share/tessdata
Download Hindi Cube data
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.bigrams
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.fold
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.lm
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.nn
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.params
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.word-freq
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.tesseract_cube.nn
-
mv hin.* /usr/local/share/tessdata
-
ln -s /opt/tesseract-3.05.01 /opt/tesseract-latest
Required by .mp3 and .ogg parser
- yum install sox
- rm cert-forensics-tools-release-el7.rpm
Install textract without unsupported features
-
rm textract/requirements/python && cp requirements/textract/python textract/requirements/python
-
cd textract && chmod +x setup.py
-
python setup.py install
-
yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for documentparser-1.0a1-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93137d65ee7193d4b8e5704dc4438ad31b9ec2a9954b5b648e9c8f1adcdc6f12 |
|
MD5 | b113aa4daf6e8fd47b15b27e1cfeb1b6 |
|
BLAKE2b-256 | e3d76290a56f1b37f90194b965fdd00dbdd0ecfaaf3779bc890ef58b57dc8888 |