Skip to main content

OCR single or multiple files

Project description

KiwanOCR

This package takes a single PDF file or a list of PDF files and returns their content as a text file.

Kiwano fruit image

  • Requirements
  • Methods

Requirements

pip install or brew install

Make sure you have installed these dependencies:

  • brew install tesseract
  • brew install poppler
  • pip pdf2images

import

Import the following:

  • from PIL import Image
  • import pytesseract ## python interface for tesseract
  • import os ## navitage, create directories
  • import shutil ## to delete the image folders with their imgs
  • from pdf2image import convert_from_path ## to turn pdf to image
  • import glob ## to glob files into a list
  • from pathlib import Path ## to specify path to your files
  • from natsort import natsorted, ns ## natural sorting
  • import re ## for regex

Methods

Setup

  1. pip install kiwanocr.
  2. from kiwano import ocr

OCR a single PDF

ocr.ocr_file(file_name, output_file_name, language, resolution)

Arguments

  • file_name: as a string
  • output_file_name: as a string
  • language: default is English (use tesseract --list-langs to retrieve langague codes )
  • resolution: default is 300 dpi (use integer value between 100 and 1200)

OCR a list of PDFs

ocr.ocr_files(list_name, output_file_name, language, resolution)

  • list_name: The only difference is to enter a list name

Output

A .txt file is placed in a output folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiwanocr-0.0.8.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

kiwanocr-0.0.8-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file kiwanocr-0.0.8.tar.gz.

File metadata

  • Download URL: kiwanocr-0.0.8.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.8

File hashes

Hashes for kiwanocr-0.0.8.tar.gz
Algorithm Hash digest
SHA256 6049a60278ae572874ee6d9f856ea1df7fcba4a2da5ea3d517ea19784194ba04
MD5 491f1ae46556755b311a719f96001185
BLAKE2b-256 ea91398502259fdad21ed2f213617190bf369598cc20be8b5b86501647df8564

See more details on using hashes here.

File details

Details for the file kiwanocr-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: kiwanocr-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.8

File hashes

Hashes for kiwanocr-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 91bb93845ec98441c350c8a63415d99c0174687fff13fff700f369eb12a45c25
MD5 64103f5094fa83354e49c4fd12a79795
BLAKE2b-256 ed65343f1a5f57290e34cca88e15b486d8ef6ebc59f43728687b69e054df531c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page