OCR single or multiple files
Project description
KiwanOCR
This package takes a single PDF file or a list of PDF files and returns their content as a text file.
- Requirements
- Methods
Requirements
pip install
or brew install
Make sure you have installed these dependencies:
brew install tesseract
brew install poppler
pip pdf2images
import
Import the following:
from PIL import Image
import pytesseract
## python interface for tesseractimport os
## navitage, create directoriesimport shutil
## to delete the image folders with their imgsfrom pdf2image import convert_from_path
## to turn pdf to imageimport glob
## to glob files into a listfrom pathlib import Path
## to specify path to your filesfrom natsort import natsorted, ns
## natural sortingimport re
## for regex
Methods
Setup
pip install kiwanocr
.from kiwano import ocr
OCR a single PDF
ocr.ocr_file(file_name, output_file_name, language, resolution)
Arguments
- file_name: as a string
- output_file_name: as a string
- language: default is English (use
tesseract --list-langs
to retrieve langague codes ) - resolution: default is 300 dpi (use integer value between 100 and 1200)
OCR a list of PDFs
ocr.ocr_files(list_name, output_file_name, language, resolution)
- list_name: The only difference is to enter a list name
Output
A .txt
file is placed in a output
folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kiwanocr-0.0.6.tar.gz
(4.9 kB
view hashes)