A simple CLI tool that allow to extract all text contained into a document.
Project description
#Document Parser
A simple CLI tool that allow to extract all text contained into a document.
Installation
Execute the followings command to before installing DocumentParser
Debian/Ubuntu
- sudo apt-get update
- sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
- apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx - pip install docparser
MacOSx
- brew install pkg-config poppler
- brew cask install xquartz
- brew install poppler antiword unrtf tesseract swig
Fedora / CentOS
Before you start you've to know that there's no a quickly way to install DocParser in a Fedora based system. This is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.
- yum -y update
- yum install python-pip
Required by the .docx parser which uses lxml via python-docx.
- yum install libxml2 libxslt-devel libxml2-devel
Required by the .docx parser which users lxml via python-docx.
- yum install libxslt
Required by the .doc and .ps parser.
- wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm
- rpm -Uvh cert-forensics-tools-release*rpm
- yum --enablerepo=forensics install antiword
- yum --enablerepo=forensics install pstotext
Require by .pdf parser
*yum install poppler-utils
Requred by .jpg, .png, gif parser
-
cd /opt
-
yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
Install AutoConf-Archive
- wget ftp://mirror.switch.ch/pool/4/mirror/epel/7/ppc64/a/autoconf-archive-2016.09.16-1.el7.noarch.rpm
- rpm -i autoconf-archive-2016.09.16-1.el7.noarch.rpm
Install Leptonica from Source
- wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz
- tar -zxvf leptonica-1.75.3.tar.gz
- cd leptonica-1.75.3
- ./autobuild
- ./configure
- make
- make install
- cd ..
Install Tesseract from Source
- wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
- tar -zxvf 3.05.01.tar.gz
- cd tesseract-3.05.01
- ./autogen.sh
- PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
- LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
- make install
- ldconfig
- cd ..
Download and install tesseract language files
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/ben.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/tha.traineddata
- wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
- mv *.traineddata /usr/local/share/tessdata
Download Hindi Cube data
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.bigrams
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.fold
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.lm
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.nn
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.params
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.word-freq
-
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.tesseract_cube.nn
-
mv hin.* /usr/local/share/tessdata
-
ln -s /opt/tesseract-3.05.01 /opt/tesseract-latest
Required by .mp3 and .ogg parser
- yum install sox
- rm cert-forensics-tools-release-el7.rpm
Install textract without unsupported features
-
rm textract/requirements/python && cp requirements/textract/python textract/requirements/python
-
cd textract && chmod +x setup.py
-
python setup.py install
-
yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file documentparser-1.0a1-py2-none-any.whl.
File metadata
- Download URL: documentparser-1.0a1-py2-none-any.whl
- Upload date:
- Size: 3.5 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93137d65ee7193d4b8e5704dc4438ad31b9ec2a9954b5b648e9c8f1adcdc6f12
|
|
| MD5 |
b113aa4daf6e8fd47b15b27e1cfeb1b6
|
|
| BLAKE2b-256 |
e3d76290a56f1b37f90194b965fdd00dbdd0ecfaaf3779bc890ef58b57dc8888
|