Python package that spits out text from your document files!
Project description
THANK YOU FOR USING TEXTSPITTER!!
I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work because of damn Poppler. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.
This is my first python module, so I hope I did this well!
Directions
This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.
from TextSpitter import TexSpitter as TS
import sqlite3
folder_loc = 'foo/bar/'
doc_file = folder_loc + 'file_thing.doc'
docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'
doc_tup = (doc_file, docx_file, pdf_file, text_file)
# SQL code to write to database
conn = sqlite3.connect('example_db')
c= conn.cursor()
STMNT = 'INSERT INTO doc_contents VALUE %s'
# For Loop code to insert doc content into db
for ele in doc_tup:
text = TS(ele)
c.executemany(STMNT, text)
print('Done! Wrote the following to db: %s', (text[:25]))
TO DOs
- push to github
- spruce up documentation
- solicit feedback
- expand functionality to other file types
- TDB
WANT TO CONTRIBUTE!?
OH MY GOD, PLEASE DO.
Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.
Thanks, everyone!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file TextSpitter-0.1.tar.gz
.
File metadata
- Download URL: TextSpitter-0.1.tar.gz
- Upload date:
- Size: 3.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b485e237ae6277f221eb0b388e324e000e7025a279c38e7a4dbcce3c486eea06 |
|
MD5 | 69465e2f7445c85936589f94abce5406 |
|
BLAKE2b-256 | d585b1172c767bee502c42228b83ec4fa02ecfc4791376ddcb9f853e42e000e0 |
File details
Details for the file TextSpitter-0.1-py3-none-any.whl
.
File metadata
- Download URL: TextSpitter-0.1-py3-none-any.whl
- Upload date:
- Size: 4.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fd56d4a6253f42b657dfd031872e21f430788307421be93f903f5f5f7995743 |
|
MD5 | 68b82359b97179d21a5c345438952c7d |
|
BLAKE2b-256 | 0382a810934583fa585d52674f4f427a7a73262ff59209d50031c21d52a8e831 |