Python library to extract text from various file formats. the supported file formats are "JPG","JPEG","PNG","PDF","DOCX","DOC" and "TEXT".
Project description
NT-TextLoader
Description
A Python module for extracting text content from various file types including PDFs, DOCX, DOC, text files, and images using Optical Character Recognition (OCR).
Installation Instructions
Before using this package, ensure you have installed the following system-level dependencies:
1.On Linux
-
Tesseract OCR and MS Office:
!apt install tesseract-ocr !apt install libtesseract-dev !apt-get --no-install-recommends install libreoffice -y !apt-get install -y libreoffice-java-common
2.On Windows
Simple steps for tesseract installation in windows.
-
1.Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
-
2.Install this exe in C:\Program Files (x86)\Tesseract-OCR
-
3.Open virtual machine command prompt in windows or anaconda prompt.
-
4.Run pip install pytesseract
To test if tesseract is installed type in python prompt:
import pytesseract
print(pytesseract)
Installation
Install the package using pip:
pip install NT-TextFileLoader
# Also you might need to install the below python packages
pip install PyPDF2
pip install python-docx
pip install docx2txt
pip install Pillow
pip install pytesseract
pip install langchain
pip install unstructured
Usage
from NT_TextFileLoader.text_loader import TextFileLoader
# Load text from a file
file_path = 'path/to/your/file'
extracted_text = TextFileLoader.load_text(file_path)
print(extracted_text)
Supported File Types
- PDF: Extracts text from PDF files.
- DOCX: Extracts text from DOCX files.
- DOC: Extracts text from legacy DOC files.
- Text files: Loads text content from TXT files.
- Images (JPG, PNG, JPEG, WEBP): Uses OCR to extract text from images.
Requirements
- PyPDF2
- python-docx
- Pillow
- pytesseract (For image-based text extraction)
- langchain
Contributions
Contributions, issues, and feature requests are welcome!
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.