Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

A multiformat text parser

Project description

   Logo

A text parser that doesn't care about your file extensions

Build Status Code Coverage SemVer Version

Key FeaturesSupported FormatsInstallationUsageRelated projectsContributingMIT License

Demo GIF

Parsa is a textract-based CLI text parser that supports multiple file extensions. It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.

Key features

  • Extends textract's functionalities to work with multiple inputs and to automatically save the output
  • Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported
  • Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path
  • Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner
  • Supports over 20 of the most common formats (see Supported formats for more)
  • Preserves the structure of document file formats (.docx, .pdf, ...)
  • Supports audio formats (.wav, .mp3, ...) via the speech recognition tools sox, SpeechRecognition and pocketsphinx
  • Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool tesseract-ocr
  • Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via --noprompt

Supported formats

See this page from textract's documentation for a full list of the supported formats and their linked dependencies.

Installation

System requirements

  • Linux
  • Python 2.7/3.x (any Python 3 version)

Linux

Via pip:

$ pip install parsa

Or, if you prefer, you can install it from source:

# Clone the repository
$ git clone https://github.com/rdimaio/parsa

# Go into the parsa folder
$ cd parsa

# Install parsa
$ python setup.py install

Tests

$ python -m unittest discover tests

Usage

Single input

# Basic usage
$ parsa path/to/input_file
# The output will be saved inside the input file's parent folder.

Multi input

# Basic usage
$ parsa path/to/input_folder
# The output will be saved inside a folder named `parsaoutput` in the input folder.

Optional: custom output folder

# Basic usage
$ parsa path/to/input -o path/to/output_folder
# Works with both single and multi input.

Optional: ignore files without an explicit extension

# Basic usage
$ parsa --noprompt path/to/input
# Useful for situations where your input includes log/system files without an extension.

Full help message

$ parsa --help
usage: parsa [-h] [--noprompt] [--output [OUTPUT]] input

Textract-based text parser that supports most text file extensions. Parsa can
parse multiple formats at once, writing them to .txt files in the directory of
choice.

positional arguments:
  input                 input file or folder; if a folder is passed as input,
                        parsa will scan every file inside it recursively
                        (scanning subfolders as well)

optional arguments:
  -h, --help            show this help message and exit
  --noprompt, -n        ignore files without an extension and don't prompt the
                        user to input their extension
  --output [OUTPUT], -o [OUTPUT]
                        folder where the output files will be stored. The default folder is:
                        (a) the input file's parent folder, if the input is a file, or
                        (b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.

Related projects

  • parsa-gui - Graphical version of parsa (WIP)
  • xparsa - Extended parsa, enhanced with statistics about the parsed files (WIP)
  • xparsa-gui - GUI for xparsa (WIP)

Contributing

Pull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for parsa, version 1.1.5
Filename, size File type Python version Upload date Hashes
Filename, size parsa-1.1.5.tar.gz (7.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page