Skip to main content

OCR single or multiple files

Project description

KiwanOCR

This package takes a single PDF file or a list of PDF files and returns their content as a text file.

Kiwano fruit image

  • Requirements
  • Methods

Requirements

pip install or brew install

Make sure you have installed these dependencies:

  • brew install tesseract
  • brew install poppler
  • pip pdf2images

import

Import the following:

  • from PIL import Image
  • import pytesseract ## python interface for tesseract
  • import os ## navitage, create directories
  • import shutil ## to delete the image folders with their imgs
  • from pdf2image import convert_from_path ## to turn pdf to image
  • import glob ## to glob files into a list
  • from pathlib import Path ## to specify path to your files
  • from natsort import natsorted, ns ## natural sorting
  • import re ## for regex

Methods

Setup

  1. pip install kiwanocr.
  2. from kiwano import ocr

OCR a single PDF

ocr.ocr_file(file_name, output_file_name, language, resolution)

Arguments

  • file_name: as a string
  • output_file_name: as a string
  • language: default is English (use tesseract --list-langs to retrieve langague codes )
  • resolution: default is 300 dpi (use integer value between 100 and 1200)

OCR a list of PDFs

ocr.ocr_files(list_name, output_file_name, language, resolution)

  • list_name: The only difference is to enter a list name

Output

A .txt file is placed in a output folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwanocr-0.0.7.tar.gz (4.9 kB view hashes)

Uploaded Source

Built Distribution

kiwanocr-0.0.7-py3-none-any.whl (5.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page