Arbitrary transliterations on Microsoft Office documents

Project description

convertextract

========

Extract and find/replace text based on arbitrary correspondences. This library is a fork from the Textract library by Dean Malmgren. https://github.com/deanmalmgren/textract

Documentation

Installation

To install, you must have Python 3.4+ and pip installed.

pip install convertextract

Some source libraries need to be installed for different operating systems to support various file formats. Visit http://textract.readthedocs.org/en/latest/installation.html for documentation.

=========

Basic CLI Use

Some basic Textract functions are preserved. Please visit http://textract.readthedocs.org for documentation.

Converting a file based on pre-existing Mappings in the G2P library

Under the hood, convertextract uses the g2p library to do conversions. There are many mappings available through that library. For a list of all possible mappings, please visit https://g2pstudio-herokuapp.com/api/v1/langs.

For this type of call, convertextract requires three arguments:

A file containing text to convert (as of Version 1.0.4, this includes .pptx, .docx, .xlsx, and .txt)
A code corresponding to the input language of the text.
A code corresponding to the desired output language of the text.

Running the command:

convertextract path/to/foo.docx -il eng-ipa -ol eng-arpabet

Will produce a new file path/to/foo_converted.docx which will contain the same content as path/to/foo.docx but with find/replace performed for all correspondences listed in the mapping between English IPA (eng-ipa) and English Arpabet (eng-arpabet).

Converting a file based on custom mapping

If the mapping you want is not supported by g2p, you should make a pull request there to have it included! Otherwise, you can use a custom file.

Running the command:

convertextract path/to/foo.docx -m path/to/rules.csv

Creating an .xlsx/.csv/.psv/.tsv correspondence sheet

Your correspondence sheet must be set up as follows:

in	out
aa	å
oe	ø
ae	æ

Here, this correspondence sheet (do not include headers like "replace with" or "find") would replace all instances of aa, oe, or ae in a given file with å, ø, or æ respectively.

Supported conversions

As of Version 3.0, any mappings that are valid in the g2p library are supported. Here are a few:

Heiltsuk Doulos Font -> Unicode

convertextract path/to/foo.docx -il hei -ol hei-doulos

Heiltsuk Times Font -> Unicode

convertextract path/to/foo.docx -il hei -ol hei-times

Tsilhqot'in Doulos Font -> Unicode

convertextract path/to/foo.docx -il clc -ol clc-doulos

Navajo Times Font -> Unicode

convertextract path/to/foo.docx -il nav -ol nav-times

Using Regular Expressions

As of Version 1.5, there is support for Regular Expressions. If you do not need to use context-sensitive conversions, you do not need to include them. However, if you do, you should set up your correspondence sheet as follows:

in	out	context_before	context_after
aa	å	[k,d]	$
aa	æ	t	$
aa	a:

For more information on how the g2p is acutally processed, please visit https://github.com/roedoejet/g2p.

Use as Python package

You can use the package in a Python script, which returns converted text, but without formatting. Running the script will still create a foo_converted.docx file.

import convertextract
text = convertextract.process('foo.docx', mapping='bar.xlsx')

You can also use convertextract to just convert text in Python using process_text.

import convertextract
text = convertextract.process_text('test', mapping=[{'in': 't', 'out': 'p', 'context_before': '^', 'context_after': 'e'}])

Use with GUI (Graphical User Interface)

Convertextract can also run in a GUI (for Mac 10.14.6 or higher only)

Installing the GUI

To download the app, go to https://github.com/roedoejet/convertextract/releases and select the most recent version.

download

Click to unzip the file, and then right-click and select open. You must right-click or the Mac permissions will not allow you to open the app. unzip open

Using the GUI

The Convertextract app has four arguments:

A file containing text to convert (As of version 3.2.2 .csv, .psv, .tsv, .doc, .docx, .txt, .eaf, .json, .pptx, .html, .xls, .xlsx are supported).
A code that specifies the desired encoding, for example UTF-8.
A code corresponding to the input language of the text.
A code corresponding to the desired output language of the text.

There is also the option for custom g2p lookup tables if your mapping is not already in the g2p library.

gui

The GUI will produce a new file path/to/foo_converted.docx which will contain the same content as path/to/foo.docx but with find/replace performed for all correspondences listed in the mapping. The file format will remain the same as the input file.

Citation

If you use convertextract in published work, please cite it. To cite this work, please use the following (APA):

Pine, A., & Turin, M. (2018). Seeing the Heiltsuk orthography from font encoding through to Unicode: A case study using convertextract. In Proceedings of the LREC 2018 Workshop “CCURL 2018–Sustaining knowledge diversity in the digital age” (pp. 27-30). European Language Resources Association.

or BibTex:

@inproceedings{pine2018convertextract,
  title={{Seeing the Heiltsuk orthography from font encoding through to Unicode: A case study using convertextract}},
  author={Pine, Aidan and Turin, Mark},
  booktitle={{Proceedings of the LREC 2018 Workshop “CCURL 2018--Sustaining knowledge diversity in the digital age”}},
  pages={27--30},
  year={2018},
  organization={{European Language Resources Association}}
}

Project details

Release history Release notifications | RSS feed

This version

3.2.17

Oct 13, 2022

3.2.16

May 19, 2021

3.2.15

May 14, 2021

3.2.14

Mar 16, 2021

3.2.13

Mar 12, 2021

3.2.12

Nov 17, 2020

3.2.11

Nov 17, 2020

3.2.10

Nov 2, 2020

3.2.9

Sep 9, 2020

3.2.8

Aug 14, 2020

3.2.7

Aug 12, 2020

3.2.2

Aug 7, 2020

3.2.1

Aug 7, 2020

3.2.0

Aug 7, 2020

3.1.2

Aug 7, 2020

3.1.1

Jul 24, 2020

3.1.0

Mar 23, 2020

3.0.0

Feb 16, 2020

2.5.0

Aug 7, 2019

2.0.2

Feb 7, 2019

2.0.1

Jan 11, 2019

2.0

Jul 24, 2018

1.6.1

Jun 5, 2018

1.6

May 7, 2018

1.5.1

Apr 30, 2018

1.5

Mar 22, 2018

1.3

Jan 16, 2018

1.2.2

Sep 27, 2017

1.2.1

Jun 18, 2017

1.2

Jun 18, 2017

1.1

Jan 22, 2017

1.0.9

Nov 27, 2016

1.0.8

Nov 27, 2016

1.0.7

Oct 28, 2016

1.0.6

Oct 21, 2016

1.0.5

Oct 21, 2016

1.0.4

Oct 14, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

convertextract-3.2.17.tar.gz (18.3 kB view details)

Uploaded Oct 13, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

convertextract-3.2.17-py3-none-any.whl (21.5 kB view details)

Uploaded Oct 13, 2022 Python 3

File details

Details for the file convertextract-3.2.17.tar.gz.

File metadata

Download URL: convertextract-3.2.17.tar.gz
Upload date: Oct 13, 2022
Size: 18.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for convertextract-3.2.17.tar.gz
Algorithm	Hash digest
SHA256	`c245c60054461c93b6f1421cd890136484604199254a8e98e56f49db8cbb485b`
MD5	`490ffa9eae86060f5821bb6b2ec55d6a`
BLAKE2b-256	`ac81ac9c10c436fb40581a79929b8e5b3ebc6075b9b9fd8a0677adfa0d3208d7`

See more details on using hashes here.

File details

Details for the file convertextract-3.2.17-py3-none-any.whl.

File metadata

Download URL: convertextract-3.2.17-py3-none-any.whl
Upload date: Oct 13, 2022
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for convertextract-3.2.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2b23f41378a55f5c08d33eedbbbe83ddcd3f3c24eb4f2428550591895cfa9c4`
MD5	`5aab6b80e9f7365c8bb35139bcb1b79a`
BLAKE2b-256	`acfc58088d5c256e8fcc15dc94d0d8c029e674465efdb933096d515c83a2c6c7`

See more details on using hashes here.

convertextract 3.2.17

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

convertextract

Documentation

Installation

Basic CLI Use

Converting a file based on pre-existing Mappings in the G2P library

Converting a file based on custom mapping

Creating an .xlsx/.csv/.psv/.tsv correspondence sheet

Supported conversions

Using Regular Expressions

Use as Python package

Use with GUI (Graphical User Interface)

Installing the GUI

Using the GUI

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes