Arbitrary transliterations on Microsoft Office documents
Project description
convertextract
========
Extract and find/replace text based on arbitrary correspondences. This library is a fork from the Textract library by Dean Malmgren. https://github.com/deanmalmgren/textract
Documentation
Installation
To install, you must have Python 3.4+ and pip installed.
pip install convertextract
Some source libraries need to be installed for different operating systems to support various file formats. Visit http://textract.readthedocs.org/en/latest/installation.html for documentation.
=========
Basic CLI Use
Some basic Textract functions are preserved. Please visit http://textract.readthedocs.org for documentation.
Converting a file based on pre-existing Mappings in the G2P library
Under the hood, convertextract uses the (g2p)[https://github.com/roedoejet/g2p] library to do conversions. There are many mappings available through that library. For a list of all possible mappings, please visit https://g2pstudio-herokuapp.com/api/v1/langs.
For this type of call, convertextract requires three arguments:
- A file containing text to convert (as of Version 1.0.4, this includes .pptx, .docx, .xlsx, and .txt)
- A code corresponding to the input language of the text.
- A code corresponding to the desired output language of the text.
Running the command:
convertextract path/to/foo.docx -il eng-ipa -ol eng-arpabet
Will produce a new file path/to/foo_converted.docx
which will contain the same content as path/to/foo.docx
but with find/replace performed for all correspondences listed in the mapping between English IPA (eng-ipa) and English Arpabet (eng-arpabet).
Converting a file based on custom mapping
If the mapping you want is not supported by g2p, you should make a pull request there to have it included! Otherwise, you can use a custom file.
Running the command:
convertextract path/to/foo.docx -m path/to/rules.csv
Will produce a new file path/to/foo_converted.docx
which will contain the same content as path/to/foo.docx
but with find/replace performed for all correspondences listed in the mapping at path/to/rules.csv
.
Creating an .xlsx/.csv/.psv/.tsv correspondence sheet
Your correspondence sheet must be set up as follows:
in | out |
---|---|
aa | å |
oe | ø |
ae | æ |
Here, this correspondence sheet (do not include headers like "replace with" or "find") would replace all instances of aa, oe, or ae in a given file with å, ø, or æ respectively.
Supported conversions
As of Version 3.0, any mappings that are valid in the g2p library are supported. Here are a few:
- Heiltsuk Doulos Font -> Unicode
convertextract path/to/foo.docx -il hei -ol hei-doulos
- Heiltsuk Times Font -> Unicode
convertextract path/to/foo.docx -il hei -ol hei-times
- Tsilhqot'in Doulos Font -> Unicode
convertextract path/to/foo.docx -il clc -ol clc-doulos
- Navajo Times Font -> Unicode
convertextract path/to/foo.docx -il nav -ol nav-times
Using Regular Expressions
As of Version 1.5, there is support for Regular Expressions. If you do not need to use context-sensitive conversions, you do not need to include them. However, if you do, you should set up your correspondence sheet as follows:
in | out | context_before | context_after |
---|---|---|---|
aa | å | [k,d] | $ |
aa | æ | t | $ |
aa | a: |
For more information on how the g2p is acutally processed, please visit https://github.com/roedoejet/g2p.
Use as Python package
You can use the package in a Python script, which returns converted text, but without formatting. Running the script will still create a foo_converted.docx
file.
import convertextract
text = convertextract.process('foo.docx', mapping='bar.xlsx')
You can also use convertextract to just convert text in Python using process_text
.
import convertextract
text = convertextract.process_text('test', mapping=[{'in': 't', 'out': 'p', 'context_before': '^', 'context_after': 'e'}])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file convertextract-3.1.1.tar.gz
.
File metadata
- Download URL: convertextract-3.1.1.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75b7f9a0e564fb62882630667a8030ce5ddeca96850425b82e888ed1b9ec62f8 |
|
MD5 | 4c91170d64254b1918c63fabe88542c8 |
|
BLAKE2b-256 | a79f71eda3829a0df24546bf8d3acaab8a6b00040c84a22c194a4f88e1e2bc77 |
File details
Details for the file convertextract-3.1.1-py3-none-any.whl
.
File metadata
- Download URL: convertextract-3.1.1-py3-none-any.whl
- Upload date:
- Size: 48.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c76771d9b26c5ed5838d7e4c79d3bf1db28538563d100f2b758f33e7ea997a0c |
|
MD5 | fe6d8597d63e73711b6b8e8dce34deee |
|
BLAKE2b-256 | 1461e385444181ef5237e9b9189ab36cb370c9cfc7b057a9aa533c7dd92035e5 |