Skip to main content

Python package that spits out text from your document files!

Project description

THANK YOU FOR USING TEXTSPITTER!!

I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work because of damn Poppler. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.

This is my first python module, so I hope I did this well!

Installation

  • Type pip install TextSpitter
  • OPTIONAL type pip install PyMuPDF to install the Python-MuPDF engine for better fidelity with text extraction (i.e.: maintaining correct White Spacing)
    • You will need to follow instructions to ensure that PyMuPDF's dependencies install to your system. There are wheels and binaries available for Windows, Linux, and MacOSX, though if you're on something weird like NetBSD/FreeBSD/specialty linux distros, you may e SOL. Fortunately, CLI options like Yum, Pkgin, Apt-Get and so forth will have packages available straight from the terminal.
    • For detailed instructions, please visit here: https://github.com/rk700/PyMuPDF and maybe give those guys some kudos, because they worked their tails off.

Directions

This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.

from TextSpitter import TexSpitter as TS
import sqlite3


folder_loc = 'foo/bar/'

# doc_file = folder_loc + 'file_thing.doc'
docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'

doc_tup = (docx_file, pdf_file, text_file)
# doc_tup = (doc_file, docx_file, pdf_file, text_file)

# SQL code to write to database
conn = sqlite3.connect('example_db')
c= conn.cursor()

STMNT = 'INSERT INTO doc_contents VALUE %s'

# For Loop code to insert doc content into db
for ele in doc_tup:
	text = TS(ele)
	c.executemany(STMNT, text)
	print('Done!  Wrote the following to db: %s', (text[:25]))

TO DOs

  • push to github
  • Remove .doc support due to legacy format's extensive proprietary reqs
  • spruce up documentation
  • solicit feedback
  • expand functionality to other file types
  • TDB

WANT TO CONTRIBUTE!?

OH MY GOD, PLEASE DO.

Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.

Thanks, everyone!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TextSpitter-0.3.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

TextSpitter-0.3-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file TextSpitter-0.3.tar.gz.

File metadata

  • Download URL: TextSpitter-0.3.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for TextSpitter-0.3.tar.gz
Algorithm Hash digest
SHA256 f07615970543cf5ea807837eb0a7ec58110c54776d7c9f9bcae4d14ea3ac33f8
MD5 a3c39be253f34568890a2ff86f7e985d
BLAKE2b-256 b36d79105006f37e4462f4b803e3b8ed3085332c2236ac7e340d0736f20af70a

See more details on using hashes here.

File details

Details for the file TextSpitter-0.3-py3-none-any.whl.

File metadata

  • Download URL: TextSpitter-0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for TextSpitter-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4e2b8dfd8500fbc8cf1c67e175c4b51ebfa2164da60649d7852b09c7e44f8eb2
MD5 f5b1ad15b3d3601c67264ed33b740317
BLAKE2b-256 5257b18619bd07fce0a34a6135ea0056be7bd473755d2edf58114ec794d6d2ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page