A multiformat text parser
Project description
A text parser that doesn't care about your file extensions
Key Features • Supported Formats • Installation • Usage • Related projects • Contributing • MIT License
Parsa is a textract-based CLI text parser that supports multiple file extensions. It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.
Key features
- Extends textract's functionalities to work with multiple inputs and to automatically save the output
- Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported
- Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path
- Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner
- Supports over 20 of the most common formats (see Supported formats for more)
- Preserves the structure of document file formats (.docx, .pdf, ...)
- Supports audio formats (.wav, .mp3, ...) via the speech recognition tools sox, SpeechRecognition and pocketsphinx
- Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool tesseract-ocr
- Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via
--noprompt
Supported formats
See this page from textract's documentation for a full list of the supported formats and their linked dependencies.
Installation
System requirements
- Linux
- Python 2.7/3.x (any Python 3 version)
Linux
Via pip
:
$ pip install parsa
Or, if you prefer, you can install it from source:
# Clone the repository
$ git clone https://github.com/rdimaio/parsa
# Go into the parsa folder
$ cd parsa
# Install parsa
$ python setup.py install
Tests
$ python -m unittest discover tests
Usage
Single input
# Basic usage
$ parsa path/to/input_file
# The output will be saved inside the input file's parent folder.
Multi input
# Basic usage
$ parsa path/to/input_folder
# The output will be saved inside a folder named `parsaoutput` in the input folder.
Optional: custom output folder
# Basic usage
$ parsa path/to/input -o path/to/output_folder
# Works with both single and multi input.
Optional: ignore files without an explicit extension
# Basic usage
$ parsa --noprompt path/to/input
# Useful for situations where your input includes log/system files without an extension.
Full help message
$ parsa --help
usage: parsa [-h] [--noprompt] [--output [OUTPUT]] input
Textract-based text parser that supports most text file extensions. Parsa can
parse multiple formats at once, writing them to .txt files in the directory of
choice.
positional arguments:
input input file or folder; if a folder is passed as input,
parsa will scan every file inside it recursively
(scanning subfolders as well)
optional arguments:
-h, --help show this help message and exit
--noprompt, -n ignore files without an extension and don't prompt the
user to input their extension
--output [OUTPUT], -o [OUTPUT]
folder where the output files will be stored. The default folder is:
(a) the input file's parent folder, if the input is a file, or
(b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.
Related projects
- parsa-gui - Graphical version of parsa (WIP)
- xparsa - Extended parsa, enhanced with statistics about the parsed files (WIP)
- xparsa-gui - GUI for xparsa (WIP)
Contributing
Pull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.