Skip to main content

No project description provided

Project description

Extractify

Extractify is a command-line tool for converting documents in various formats (.pdf, .doc, .docx, .xlsx, .txt) to plain text. The tool works with both local directories and S3 buckets. For local directories, the tool creates a 'txt' subdirectory within the specified input directory and saves the plain text files with the same filenames but with a .txt extension. For S3 buckets, it saves the plain text files in a 'txt' folder under the specified prefix.

Installation

Install Extractify using pip:

pip install extractify

Usage

Locally

To use Extractify with a local directory, run the following command:

extractify <input_dir>

(Replace <input_dir> with the path to the directory containing the documents you want to convert.)

In S3

To use Extractify with an S3 bucket, run the following command:

extractify s3://bucket-name/prefix

Replace bucket-name and prefix with the appropriate values for your S3 bucket.

Omit PDF files from the process

To omit PDF file formats, add the --omit-pdf flag:

extractify <input_dir or s3_bucket_address> --omit-pdf

Output

Extractify will create a 'txt' subdirectory within the input directory and save the plain text files there.

Supported Formats

Extractify currently supports the following document formats:

  • .pdf
  • .doc
  • .docx
  • .xlsx
  • .txt

Dependencies

Extractify requires the following Python libraries:

  • tika
  • openpyxl
  • argparse
  • tqdm
  • boto3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractify-0.0.4.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extractify-0.0.4-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file extractify-0.0.4.tar.gz.

File metadata

  • Download URL: extractify-0.0.4.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for extractify-0.0.4.tar.gz
Algorithm Hash digest
SHA256 48b00169fdd5b3283c922fbd10ca159607e53bf781c3c8c81c24aa588abb958f
MD5 017ae2a8da2adf5e3d10ef7ee9efd0c5
BLAKE2b-256 c0b01ff6c2bbd210c4695b7e9e7b70969ec755cd0d71abc9bb603c461d59e714

See more details on using hashes here.

File details

Details for the file extractify-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: extractify-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for extractify-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bf098df29e7934f40b040d3e3c3e46f69de0754c7b87a8e2a382f2f3bad7d0b3
MD5 906dc8a4bebc8ee996909f73b4776966
BLAKE2b-256 6f9a838639b3f285b2d3d90ab7924f1bf16faebecf1aded17ed6de9376e651f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page