A multiformat text parser
A text parser that doesn't care about your file extensions
Key Features • Supported Formats • Installation • Usage • Related projects • Contributing • MIT License
Parsa is a textract-based CLI text parser that supports multiple file extensions. It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.
- Extends textract's functionalities to work with multiple inputs and to automatically save the output
- Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported
- Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path
- Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner
- Supports over 20 of the most common formats (see Supported formats for more)
- Preserves the structure of document file formats (.docx, .pdf, ...)
- Supports audio formats (.wav, .mp3, ...) via the speech recognition tools sox, SpeechRecognition and pocketsphinx
- Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool tesseract-ocr
- Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via
See this page from textract's documentation for a full list of the supported formats and their linked dependencies.
- Python 2.7/3.x (any Python 3 version)
$ pip install parsa
Or, if you prefer, you can install it from source:
# Clone the repository $ git clone https://github.com/rdimaio/parsa # Go into the parsa folder $ cd parsa # Install parsa $ python setup.py install
$ python -m unittest discover tests
# Basic usage $ parsa path/to/input_file # The output will be saved inside the input file's parent folder.
# Basic usage $ parsa path/to/input_folder # The output will be saved inside a folder named `parsaoutput` in the input folder.
Optional: custom output folder
# Basic usage $ parsa path/to/input -o path/to/output_folder # Works with both single and multi input.
Optional: ignore files without an explicit extension
# Basic usage $ parsa --noprompt path/to/input # Useful for situations where your input includes log/system files without an extension.
Full help message
$ parsa --help usage: parsa [-h] [--noprompt] [--output [OUTPUT]] input Textract-based text parser that supports most text file extensions. Parsa can parse multiple formats at once, writing them to .txt files in the directory of choice. positional arguments: input input file or folder; if a folder is passed as input, parsa will scan every file inside it recursively (scanning subfolders as well) optional arguments: -h, --help show this help message and exit --noprompt, -n ignore files without an extension and don't prompt the user to input their extension --output [OUTPUT], -o [OUTPUT] folder where the output files will be stored. The default folder is: (a) the input file's parent folder, if the input is a file, or (b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.
- parsa-gui - Graphical version of parsa (WIP)
- xparsa - Extended parsa, enhanced with statistics about the parsed files (WIP)
- xparsa-gui - GUI for xparsa (WIP)
Pull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.
This project is licensed under the MIT License - see the LICENSE file for details.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.