Skip to main content

Python package that spits out text from your document files!

Project description

THANK YOU FOR USING TEXTSPITTER!!

I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work because of damn Poppler. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.

This is my first python module, so I hope I did this well!

Directions

This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.

from TextSpitter import TexSpitter as TS
import sqlite3


folder_loc = 'foo/bar/'

# doc_file = folder_loc + 'file_thing.doc'
docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'

doc_tup = (docx_file, pdf_file, text_file)
# doc_tup = (doc_file, docx_file, pdf_file, text_file)

# SQL code to write to database
conn = sqlite3.connect('example_db')
c= conn.cursor()

STMNT = 'INSERT INTO doc_contents VALUE %s'

# For Loop code to insert doc content into db
for ele in doc_tup:
	text = TS(ele)
	c.executemany(STMNT, text)
	print('Done!  Wrote the following to db: %s', (text[:25]))

TO DOs

  • push to github
  • Remove .doc support due to legacy format's extensive proprietary reqs
  • spruce up documentation
  • solicit feedback
  • expand functionality to other file types
  • TDB

WANT TO CONTRIBUTE!?

OH MY GOD, PLEASE DO.

Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.

Thanks, everyone!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TextSpitter-0.2.post0.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

TextSpitter-0.2.post0-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file TextSpitter-0.2.post0.tar.gz.

File metadata

  • Download URL: TextSpitter-0.2.post0.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for TextSpitter-0.2.post0.tar.gz
Algorithm Hash digest
SHA256 466b03b2ee4794606398860e4413121a4851fe808b3585b0d9bec87117ae09c4
MD5 680bbee604983457d7f3f93dfab03dbf
BLAKE2b-256 2c6c0168c32625263c807a857dfb4aeec6c9d3f09267089ab584e26cd316f51c

See more details on using hashes here.

File details

Details for the file TextSpitter-0.2.post0-py3-none-any.whl.

File metadata

  • Download URL: TextSpitter-0.2.post0-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for TextSpitter-0.2.post0-py3-none-any.whl
Algorithm Hash digest
SHA256 0792f3e0603ca0b8a52363b8caafdd16fbb377565956ec5339fa64acd21e0370
MD5 7caf7ab6bea54c7fc28992cadaff95e8
BLAKE2b-256 0e3352d7d62826fc55f2852647ab643bd6473e7f548d17a69269bbc2a31362d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page