Tool to extract and store sentence embeddings to a fast and scalable vector db
Project description
pdf_ocr_txt
Applies OCR to a PDF file and extracts its content to a TXT file.
Assumes tesseract and poplertools installed!
USAGE
$ python3
>>>from pdf_ocr_txt.main import pdf_to_text
>>>pdf_to_text('my_file.pdf','my_output_directory')
See result in my_output_directory/my_file.pdf.txt
.
Enjoy, Paul Tarau May 2024
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf_ocr_txt-0.5.2.tar.gz
(3.1 kB
view hashes)