Skip to main content

Extract text content from many filetypes.

Project description

Extract text content from many filetypes in pure Python. This package extracts pure text from many office filetypes. Only three external (pure Python) libraries are needed to work. After extracting you get a list of words with the most common stop words stripped out (only en, de).

Install with: pip install TExtractor

Usage:

>>> from textractor import TExtractor
>>> extractor = TExtractor()
>>> extractor.index('test.docx', lang='en')
['workflow_history', 'portal_workflow', 'review_history',
 'implementation', 'organizations', 'Illustrations', ...]
>>> extractor.index('test.pdf', lang='en')
['workflow_history', 'portal_workflow', 'review_history',
 'implementation', 'organizations', 'Illustrations', ...]
>>>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TExtractor-0.1.2.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

TExtractor-0.1.2-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file TExtractor-0.1.2.tar.gz.

File metadata

  • Download URL: TExtractor-0.1.2.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for TExtractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 cd4e5af2eb6d343815f83d6c900d9390ed6ea518071aeeb7a6b0224d8f9a0a20
MD5 20550312e85a00fd6b839023db79463f
BLAKE2b-256 deb9c94be3c965497db0b59e1a9715b5b7e75a919056f1bfb5adc8ea6a2a37b4

See more details on using hashes here.

File details

Details for the file TExtractor-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: TExtractor-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for TExtractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a4c7e77e786731000035064217f2ef9fbd8e7116bd497c3d46151229535d6c5a
MD5 2e05d552001e183e88a473a05f6265b3
BLAKE2b-256 c346ee0f03fb43dc117bad87bc5e20d3d70a32b7f944021aa2e6a0000a724d39

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page