Skip to main content

A multiformat text parser

Project description

   Logo

A text parser that doesn't care about your file extensions

Build Status Code Coverage SemVer Version

Key FeaturesSupported FormatsInstallationUsageRelated projectsContributingMIT License

Demo GIF

Parsa is a textract-based CLI text parser that supports multiple file extensions. It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.

Key features

  • Extends textract's functionalities to work with multiple inputs and to automatically save the output
  • Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported
  • Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path
  • Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner
  • Supports over 20 of the most common formats (see Supported formats for more)
  • Preserves the structure of document file formats (.docx, .pdf, ...)
  • Supports audio formats (.wav, .mp3, ...) via the speech recognition tools sox, SpeechRecognition and pocketsphinx
  • Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool tesseract-ocr
  • Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via --noprompt

Supported formats

See this page from textract's documentation for a full list of the supported formats and their linked dependencies.

Installation

System requirements

  • Linux
  • Python 2.7/3.x (any Python 3 version)

Linux

Via pip:

$ pip install parsa

Or, if you prefer, you can install it from source:

# Clone the repository
$ git clone https://github.com/rdimaio/parsa

# Go into the parsa folder
$ cd parsa

# Install parsa
$ python setup.py install

Tests

$ python -m unittest discover tests

Usage

Single input

# Basic usage
$ parsa path/to/input_file
# The output will be saved inside the input file's parent folder.

Multi input

# Basic usage
$ parsa path/to/input_folder
# The output will be saved inside a folder named `parsaoutput` in the input folder.

Optional: custom output folder

# Basic usage
$ parsa path/to/input -o path/to/output_folder
# Works with both single and multi input.

Optional: ignore files without an explicit extension

# Basic usage
$ parsa --noprompt path/to/input
# Useful for situations where your input includes log/system files without an extension.

Full help message

$ parsa --help
usage: parsa [-h] [--noprompt] [--output [OUTPUT]] input

Textract-based text parser that supports most text file extensions. Parsa can
parse multiple formats at once, writing them to .txt files in the directory of
choice.

positional arguments:
  input                 input file or folder; if a folder is passed as input,
                        parsa will scan every file inside it recursively
                        (scanning subfolders as well)

optional arguments:
  -h, --help            show this help message and exit
  --noprompt, -n        ignore files without an extension and don't prompt the
                        user to input their extension
  --output [OUTPUT], -o [OUTPUT]
                        folder where the output files will be stored. The default folder is:
                        (a) the input file's parent folder, if the input is a file, or
                        (b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.

Related projects

  • parsa-gui - Graphical version of parsa (WIP)
  • xparsa - Extended parsa, enhanced with statistics about the parsed files (WIP)
  • xparsa-gui - GUI for xparsa (WIP)

Contributing

Pull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsa-1.1.5.tar.gz (7.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page