Skip to main content

A simple CLI tool that allow to extract all text contained into a document.

Project description

#Document Parser

A simple CLI tool that allow to extract all text contained into a document.

Installation

Execute the followings command to before installing DocumentParser

Debian/Ubuntu

  • sudo apt-get update
  • sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
  • apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
    flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx
  • pip install docparser

MacOSx

  • brew install pkg-config poppler
  • brew cask install xquartz
  • brew install poppler antiword unrtf tesseract swig

Fedora / CentOS

Before you start you've to know that there's no a quickly way to install DocParser in a Fedora based system. This is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.

  • yum -y update
  • yum install python-pip

Required by the .docx parser which uses lxml via python-docx.

  • yum install libxml2 libxslt-devel libxml2-devel

Required by the .docx parser which users lxml via python-docx.

  • yum install libxslt

Required by the .doc and .ps parser.

Require by .pdf parser

*yum install poppler-utils

Requred by .jpg, .png, gif parser

  • cd /opt

  • yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel

Install AutoConf-Archive

Install Leptonica from Source

Install Tesseract from Source

  • wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
  • tar -zxvf 3.05.01.tar.gz
  • cd tesseract-3.05.01
  • ./autogen.sh
  • PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
  • LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
  • make install
  • ldconfig
  • cd ..

Download and install tesseract language files

Download Hindi Cube data

Required by .mp3 and .ogg parser

  • yum install sox
  • rm cert-forensics-tools-release-el7.rpm

Install textract without unsupported features

  • git clone https://github.com/deanmalmgren/textract.git

  • rm textract/requirements/python && cp requirements/textract/python textract/requirements/python

  • cd textract && chmod +x setup.py

  • python setup.py install

  • yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

documentparser-1.0a1-py2-none-any.whl (3.5 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page