Skip to main content

Python library to extract text from various file formats. The supported formats are: JPEG, PNG, PDF, DOCX, DOC, and TEXT.

Project description

NT-TextLoader

N|Solid

Description

A Python module for extracting text content from various file types including PDFs, DOCX, DOC, text files, and images using Optical Character Recognition (OCR).

Installation Instructions

Before using this package, ensure you have installed the following system-level dependencies:

1.On Linux

  • Tesseract OCR and MS Office:

    !apt install tesseract-ocr
    !apt install libtesseract-dev
    !apt-get --no-install-recommends install libreoffice -y
    !apt-get install -y libreoffice-java-common
    

2.On Windows

Simple steps for tesseract installation in windows.

  • 1.Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

  • 2.Install this exe in C:\Program Files (x86)\Tesseract-OCR

  • 3.Open virtual machine command prompt in windows or anaconda prompt.

  • 4.Run pip install pytesseract

To test if tesseract is installed type in python prompt:

import pytesseract
print(pytesseract)

Installation

Install the package using pip:

pip install NT-TextFileLoader

Usage

from NT_TextFileLoader.text_loader import TextFileLoader

# Load text from a file
file_path = 'path/to/your/file'
extracted_text = TextFileLoader.load_text(file_path,min_text_length=50) 
# If the ouput length is lesser than 50(min_text_length) then OCR will be used to extract text.
# Increate the min_text_length value to use OCR.
print(extracted_text)

Supported File Types

  • PDF: Extracts text from PDF files.
  • DOCX: Extracts text from DOCX files.
  • DOC: Extracts text from legacy DOC files.
  • Text files: Loads text content from TXT files.
  • Images (JPG, PNG, JPEG, WEBP): Uses OCR to extract text from images.

Requirements

  • PyPDF2
  • python-docx
  • Pillow
  • pytesseract (For image-based text extraction)
  • langchain
  • unstructured
  • docx2txt
  • PyMuPDF

Contributions

Contributions, issues, and feature requests are welcome!

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NT_TextFileLoader-2.0.1.tar.gz (5.5 kB view details)

Uploaded Source

File details

Details for the file NT_TextFileLoader-2.0.1.tar.gz.

File metadata

  • Download URL: NT_TextFileLoader-2.0.1.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for NT_TextFileLoader-2.0.1.tar.gz
Algorithm Hash digest
SHA256 797e967222b4dad517090df6038e45e78843b584b15c7f90ceab7ef50f4ffe4c
MD5 89496d6f2005652e87adcf4ce67107e4
BLAKE2b-256 ee603a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page