Skip to main content

Arbitrary transliterations on Microsoft Office documents

Project description

convertextract

========

Build Status Version Coverage Status

Extract and find/replace text based on arbitrary correspondences. This library is a fork from the Textract library by Dean Malmgren. https://github.com/deanmalmgren/textract

Documentation

Installation

To install, you must have Python 3.4+ and pip installed.

pip install convertextract

Some source libraries need to be installed for different operating systems to support various file formats. Visit http://textract.readthedocs.org/en/latest/installation.html for documentation.

=========

Basic CLI Use

Some basic Textract functions are preserved. Please visit http://textract.readthedocs.org for documentation.

Converting a file based on pre-existing Mappings in the G2P library

Under the hood, convertextract uses the (g2p)[https://github.com/roedoejet/g2p] library to do conversions. There are many mappings available through that library. For a list of all possible mappings, please visit https://g2pstudio-herokuapp.com/api/v1/langs.

For this type of call, convertextract requires three arguments:

  1. A file containing text to convert (as of Version 1.0.4, this includes .pptx, .docx, .xlsx, and .txt)
  2. A code corresponding to the input language of the text.
  3. A code corresponding to the desired output language of the text.

Running the command:

convertextract path/to/foo.docx -il eng-ipa -ol eng-arpabet

Will produce a new file path/to/foo_converted.docx which will contain the same content as path/to/foo.docx but with find/replace performed for all correspondences listed in the mapping between English IPA (eng-ipa) and English Arpabet (eng-arpabet).

Converting a file based on custom mapping

If the mapping you want is not supported by g2p, you should make a pull request there to have it included! Otherwise, you can use a custom file.

Running the command:

convertextract path/to/foo.docx -m path/to/rules.csv

Will produce a new file path/to/foo_converted.docx which will contain the same content as path/to/foo.docx but with find/replace performed for all correspondences listed in the mapping at path/to/rules.csv.

Creating an .xlsx/.csv/.psv/.tsv correspondence sheet

Your correspondence sheet must be set up as follows:

in out
aa å
oe ø
ae æ

Here, this correspondence sheet (do not include headers like "replace with" or "find") would replace all instances of aa, oe, or ae in a given file with å, ø, or æ respectively.

Supported conversions

As of Version 3.0, any mappings that are valid in the g2p library are supported. Here are a few:

  • Heiltsuk Doulos Font -> Unicode
convertextract path/to/foo.docx -il hei -ol hei-doulos
  • Heiltsuk Times Font -> Unicode
convertextract path/to/foo.docx -il hei -ol hei-times
  • Tsilhqot'in Doulos Font -> Unicode
convertextract path/to/foo.docx -il clc -ol clc-doulos
  • Navajo Times Font -> Unicode
convertextract path/to/foo.docx -il nav -ol nav-times

Using Regular Expressions

As of Version 1.5, there is support for Regular Expressions. If you do not need to use context-sensitive conversions, you do not need to include them. However, if you do, you should set up your correspondence sheet as follows:

in out context_before context_after
aa å [k,d] $
aa æ t $
aa a:

For more information on how the g2p is acutally processed, please visit https://github.com/roedoejet/g2p.

Use as Python package

You can use the package in a Python script, which returns converted text, but without formatting. Running the script will still create a foo_converted.docx file.

import convertextract
text = convertextract.process('foo.docx', mapping='bar.xlsx')

You can also use convertextract to just convert text in Python using process_text.

import convertextract
text = convertextract.process_text('test', mapping=[{'in': 't', 'out': 'p', 'context_before': '^', 'context_after': 'e'}])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

convertextract-3.0.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

convertextract-3.0.0-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file convertextract-3.0.0.tar.gz.

File metadata

  • Download URL: convertextract-3.0.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.7.4

File hashes

Hashes for convertextract-3.0.0.tar.gz
Algorithm Hash digest
SHA256 a7a55e55f54dc527a78d0539a3f1bad4ed26c3c44903bfce190b5ec4e5f61e45
MD5 041f79af57bb10de0a82c29ac953b001
BLAKE2b-256 1bb5d475a3ce2127200f605d78b5ef8862cc572db6164e040e879a598bcb1e06

See more details on using hashes here.

File details

Details for the file convertextract-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: convertextract-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 46.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.7.4

File hashes

Hashes for convertextract-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6939e743ae53e2de70451fbcb544f8f8014edd5aab7f1a470bf484d234069c6e
MD5 ac46f0a887de068f1d3d00120941b02d
BLAKE2b-256 23e41c5931298e697af0e37c7b987683626e7434c6ac21d4ae89832938d310b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page