Skip to main content

OCR while preserving document formatting and layout

Project description

🧾 ocralign

ocralign is an OCR utility built on top of Tesseract that preserves the layout and formatting of scanned documents. It supports both PDFs and images and outputs clean, structured text.


🔧 System Requirements

Before installing the Python package, you need to install some system dependencies required by pytesseract and pdf2image:

sudo apt update
sudo apt install -y tesseract-ocr
sudo apt install -y poppler-utils

Installation

pip install ocralign

Usage example

from ocralign import process_pdf, process_image

# OCR a single image
print(process_image("./sample.png"))

# OCR a multi-page PDF (returns list of text per page)
texts = process_pdf("./images-pdf.pdf", 
                    type ="image", # if the PDF is scanned. Else: "digital"
                    layout = "normalized", # Available options: "normalized", "absolute", "none".
                    # For digital PDFs - "normalized" or "absolute" would produce formatted output. "none" will produce unformatted output.
                    # For PDFs wit images - "normalized": formatted output without absolute vertical line positioning. "absolute": formatted output with absolute vertical lines. "none": not supported.
                    add_marker = True, # Add page boundary in the output
                    dpi=300)

# OCR a PDF and write result to a file
process_pdf("./images-pdf.pdf", dpi=300, output_path="test.txt")

Input image:

Sample OCR Input

Extracted Text 📎 See full output here

Sample Tables                                                                                = Print

 Tables used in papers can be so simple that they are "informal" enough to be a sentence member and not
 require a caption, or they can be complex enough that they require spreadsheets spanning several pages.
 A table’s fundamental purpose should always be to visually simplify complex material, in particular when
 the table is designed to help the reader identify trends. Here, a simple table and a complex table are used
 to demonstrate how tables help writers to record and "visualize" information and data.


 Simple Table

 The simple table that follows, from a student's progress report to his advisor, represents how tables need
 not always be about data presentation. Here the rows and columns simply make it easy for the writer to
 present the necessary information with efficiency. This unnumbered and informal table, in effect, explains
 itself.




                     Plan for Weekly Progress for the Remainder of the Semester

      Week of     Contact Dr. Berinni for relevant literature suggestions.
      11/28       Read lit reviews from Vibrational Spectroscopy.
                  Research experimental methods used to test polyurethanes, including infrared (IR)
                  spectroscopy and nuclear magnetic resonance (NMR).

      Week of     Define specific ways that polyurethanes can be improved.
      12/5        Develop experimental plan.

      Week of     Create visual aids, depicting chemical reactions and experimental setups.
      12/12       Prepare draft of analytical report.

      Week of     Turn in copy of preliminary analytical report, to be expanded upon next semester.
      12/18





 Complex Table

 The following sample table is excerpted from a student's senior thesis about tests conducted on
 Pennsylvania coal. Note the specificity of the table’s caption. Also note the level of discussion following the
 table, and how the writer uses the data from the table to move toward an explanation of the trends that
 the table reveals.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocralign-0.1.3.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocralign-0.1.3-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file ocralign-0.1.3.tar.gz.

File metadata

  • Download URL: ocralign-0.1.3.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ocralign-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3d16383876dbfc0bf76a37e0034a6cd12a04b146c487856b55857f8e8511af4b
MD5 86ef5f4a2b80eca8fb811e63e079d610
BLAKE2b-256 e9e36bcbe04afb8bf860937b543ebb1fa330130f865429bffc715ee3eaf90cd0

See more details on using hashes here.

File details

Details for the file ocralign-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: ocralign-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ocralign-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bf3826eb8226e1acd4f92ba42fed686e4f244998c385d7151906ca493a10e5f1
MD5 5ceed6c84438465217ca42577263643a
BLAKE2b-256 f69ec9a068cff2a6b460b9031f8fa6a6964823284c2e299faccc8363e0ebaed2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page