A simple tool for text extraction from pdf, epub, txt, and docx files
Project description
extractText
A simple tool for extracting text from PDF, EPUB, TXT, and DOCX files. This library was primarily developed for personal use in various NLP-related projects.
Parsers used:
pdfplumber, pytesseract, PyPDF2, pdf2image and PIL for PDF processing
ebooklib, bs4 for EPUB
docx for DOCX
Installation
Install text-extra
using pip:
pip install text-extra
Usage
extracted_text = extract_text(file_path)
if isinstance(extracted_text, dict):
for key, value in extracted_text.items():
print(f"--- {key} ---\n{value}\n")
else:
print(extracted_text)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-extra-0.1.4.tar.gz
(42.5 kB
view hashes)
Built Distribution
text_extra-0.1.4-py3-none-any.whl
(29.3 kB
view hashes)
Close
Hashes for text_extra-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bf4f37bf619a53c699b8588a99429d8c486cedab0a210f2a00713d321476fa2 |
|
MD5 | 558f94e6a7c644d40d716b2a7d026942 |
|
BLAKE2b-256 | d8d07ed43c10a7c37e064ed54d3775664e93fcd62cbec09284510cf2b8240018 |