Skip to main content

A simple CLI tool that allow to extract all text contained into a document.

Project description

#Document Parser

A simple CLI tool that allow to extract all text contained into a document.

Installation

Execute the followings command to before installing DocumentParser

Debian/Ubuntu

  • sudo apt-get update
  • sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
  • apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr
    flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx
  • pip install docparser

MacOSx

  • brew install pkg-config poppler
  • brew cask install xquartz
  • brew install poppler antiword unrtf tesseract swig

Fedora / CentOS

Before you start you've to know that there's no a quickly way to install DocParser in a Fedora based system. This is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.

  • yum -y update
  • yum install python-pip

Required by the .docx parser which uses lxml via python-docx.

  • yum install libxml2 libxslt-devel libxml2-devel

Required by the .docx parser which users lxml via python-docx.

  • yum install libxslt

Required by the .doc and .ps parser.

Require by .pdf parser

*yum install poppler-utils

Requred by .jpg, .png, gif parser

  • cd /opt

  • yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel

Install AutoConf-Archive

Install Leptonica from Source

Install Tesseract from Source

  • wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
  • tar -zxvf 3.05.01.tar.gz
  • cd tesseract-3.05.01
  • ./autogen.sh
  • PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
  • LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
  • make install
  • ldconfig
  • cd ..

Download and install tesseract language files

Download Hindi Cube data

Required by .mp3 and .ogg parser

  • yum install sox
  • rm cert-forensics-tools-release-el7.rpm

Install textract without unsupported features

  • git clone https://github.com/deanmalmgren/textract.git

  • rm textract/requirements/python && cp requirements/textract/python textract/requirements/python

  • cd textract && chmod +x setup.py

  • python setup.py install

  • yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

documentparser-1.0a1-py2-none-any.whl (3.5 kB view details)

Uploaded Python 2

File details

Details for the file documentparser-1.0a1-py2-none-any.whl.

File metadata

File hashes

Hashes for documentparser-1.0a1-py2-none-any.whl
Algorithm Hash digest
SHA256 93137d65ee7193d4b8e5704dc4438ad31b9ec2a9954b5b648e9c8f1adcdc6f12
MD5 b113aa4daf6e8fd47b15b27e1cfeb1b6
BLAKE2b-256 e3d76290a56f1b37f90194b965fdd00dbdd0ecfaaf3779bc890ef58b57dc8888

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page