Skip to main content

off2txt: extract text from Office files

Project description

Extracts ASCII/Unicode text from Office files to separate files.

Useful if you have a document containing two languages (e.g. English and Chinese) and you want to separate the languages into text files for further processing and analysis.

Supports Open XML file formats. That is, docx, pptx, and xlsx.

Word and PowerPoint files are extracted to text files. Excel files are extracted to CSV files, columns are preserved.

Can be used to make a CSV file from Excel without opening Excel.

Examples

Extract ASCII and Unicode Text From a Word Document

$ off2txt -s word.docx

The above will make two files: word-ascii.txt and word-unicode.txt

Extract ASCII and Unicode Text From an Excel Document

$ off2txt -s excel.xlsx

The above will make two files: excel-ascii.csv and excel-unicode.csv

Notes

If an extracted file would be empty, it is not created.

Excel is different. Columns are preserved. So may get a CSV file of empty columns. Cells are put in the extracted ASCII file if they containt ASCII only otherwise they are streamed to the Unicode file.

Usage

usage: off2txt [options] File [File ...]

off2txt: extract ASCII/Unicode text from Office files to separate files

positional arguments:
  File                  Files to extract from

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --debug               Turn on debug logging.
  --debug-log FILE      Save debug logging to FILE.
  -a EXTENSION, --ascii EXTENSION
                        Identifier to append to input file name to make ASCII
                        output file name when splitting Unicode and ASCII
                        text. Default ascii.
  -d DIRECTORY, --directory DIRECTORY
                        Save extracted text to DIRECTORY. Ignored if the -o
                        option is given.
  -e EXTENSION, --extension EXTENSION
                        Extension to use for extracted text files. Default for
                        Word and PowerPoint is txt. Default for Excel is csv.
  -o FILE, --output FILE
                        Save extracted text to FILE. If not given, the output
                        file is named the same as the input file but with a
                        txt extension. The extension can be changed with the
                        -e option. Files are opened in append mode unless the
                        -X option is given.
  -s, --split           Split ASCII and Unicode text into two separate files.
                        Unicode files are named by adding -unicode before the
                        file extension. The Unicode identifer can be changed
                        with the -u option.
  -u EXTENSION, --unicode EXTENSION
                        Identifier to append to input file name to make
                        Unicode output file name when splitting Unicode and
                        ASCII text. Default unicode.
  -A, --suppress-file-access-errors
                        Do not print file/directory access errors.
  -X, --overwrite-output-files
                        Truncate output files before writing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

off2txt-0.1.0.tar.gz (187.3 kB view details)

Uploaded Source

Built Distribution

off2txt-0.1.0-py2.py3-none-any.whl (9.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file off2txt-0.1.0.tar.gz.

File metadata

  • Download URL: off2txt-0.1.0.tar.gz
  • Upload date:
  • Size: 187.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for off2txt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7d07e26956fd009e19757304d264a067071d3bbbae62ad098f466832356118cc
MD5 6fecb2037ef4f572a3d83fe7f3cd400c
BLAKE2b-256 28edb382a3e32ae388f570969ed039da2d6e0956e36ec0ad53546e8a81a07636

See more details on using hashes here.

File details

Details for the file off2txt-0.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for off2txt-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 73f7b5626f432cd164fdc884bb2125ff8fd096022fb309d53620d990b3dd96c6
MD5 5ad349ec4d77736038ec3e44ffea384b
BLAKE2b-256 87e27a56fdf87832d639b7ebdab7cc02ace8be698cdca513b15bfb071264609d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page