Skip to main content

Python package that spits out text from your document files!

Project description

THANK YOU FOR USING TEXTSPITTER!!

I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work because of damn Poppler. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.

This is my first python module, so I hope I did this well!

Installation

  • Type pip install TextSpitter
  • OPTIONAL type pip install PyMuPDF to install the Python-MuPDF engine for better fidelity with text extraction (i.e.: maintaining correct White Spacing)
    • You will need to follow instructions to ensure that PyMuPDF's dependencies install to your system. There are wheels and binaries available for Windows, Linux, and MacOSX, though if you're on something weird like NetBSD/FreeBSD/specialty linux distros, you may e SOL. Fortunately, CLI options like Yum, Pkgin, Apt-Get and so forth will have packages available straight from the terminal.
    • For detailed instructions, please visit here: https://github.com/rk700/PyMuPDF and maybe give those guys some kudos, because they worked their tails off.

Directions

This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.

from TextSpitter import TexSpitter as TS
folder_loc = 'foo/bar/'

docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'

doc_tup = (docx_file, pdf_file, text_file)

raw_text_payload = [TS(filename=ele) for ele in doc_tup]
text = '\n'.join(raw_text_payload)
return text

TO DOs

  • spruce up documentation
  • Add stream functionality for s3-based file reading
  • expand functionality to other file types
  • TDB

WANT TO CONTRIBUTE!?

OH MY GOD, PLEASE DO.

Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.

Thanks, everyone!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TextSpitter-0.3.5a4.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

TextSpitter-0.3.5a4-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file TextSpitter-0.3.5a4.tar.gz.

File metadata

  • Download URL: TextSpitter-0.3.5a4.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6

File hashes

Hashes for TextSpitter-0.3.5a4.tar.gz
Algorithm Hash digest
SHA256 4e835ef0678875d442e16f30d7172fad412eb8de88b11ca4105a00a543f1ca6c
MD5 0864d3a2c21a1f52586d15e6e5e94b5f
BLAKE2b-256 e227b6ee307815e32cce46c9f754252a006aa98a4ed5e360fcf74141fefeb545

See more details on using hashes here.

File details

Details for the file TextSpitter-0.3.5a4-py3-none-any.whl.

File metadata

  • Download URL: TextSpitter-0.3.5a4-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6

File hashes

Hashes for TextSpitter-0.3.5a4-py3-none-any.whl
Algorithm Hash digest
SHA256 9d88f6695220422b918f5a7a2c718d878197a1ebc8fa26bd36b10a04fd5dde5c
MD5 fdeaf0488059deaba222e9e3bff2dcab
BLAKE2b-256 a0ac2bbacae58bb12435fa5a31e931eb9d84c8f0dea44f0711ffd5e2305eb011

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page