Skip to main content

No project description provided

Project description

Extractify

Extractify is a command-line tool for converting documents in various formats (.pdf, .doc, .docx, .xlsx, .txt) to plain text. The tool works with both local directories and S3 buckets. For local directories, the tool creates a 'txt' subdirectory within the specified input directory and saves the plain text files with the same filenames but with a .txt extension. For S3 buckets, it saves the plain text files in a 'txt' folder under the specified prefix.

Installation

Install Extractify using pip:

pip install extractify

Usage

Locally

To use Extractify with a local directory, run the following command:

extractify <input_dir>

(Replace <input_dir> with the path to the directory containing the documents you want to convert.)

In S3

To use Extractify with an S3 bucket, run the following command:

extractify s3://bucket-name/prefix

Replace bucket-name and prefix with the appropriate values for your S3 bucket.

Omit PDF files from the process

To omit PDF file formats, add the --omit-pdf flag:

extractify <input_dir or s3_bucket_address> --omit-pdf

Output

Extractify will create a 'txt' subdirectory within the input directory and save the plain text files there.

Supported Formats

Extractify currently supports the following document formats:

  • .pdf
  • .doc
  • .docx
  • .xlsx
  • .txt

Dependencies

Extractify requires the following Python libraries:

  • tika
  • openpyxl
  • argparse
  • tqdm
  • boto3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractify-0.0.4.tar.gz (3.1 kB view hashes)

Uploaded Source

Built Distribution

extractify-0.0.4-py3-none-any.whl (3.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page