Skip to main content

A Python module that exposes text for modification in multiple file types.

Project description

ExposeText

Expose the text in a document for modification.


PyPI version Tests Black & Flake8 Code style: Black MIT license

:warning: Disclaimer :warning:: This is a prototype. Do not use for anything critical.

What is ExposeText?

Dealing with document file formats can be quite painful. Oftentimes code must be written that’s specific to one file format. We have written ExposeText with the goal to make modifying documents as simple as changing Python strings. A slice of the original document can be directly assigned a new content by using the character indices of the extracted text, all while keeping the document's original formatting.

We published a blog post about ExposeText on Medium.

Supported Formats

ExposeText has prototypical support for the following file formats:

  • .txt
    • Per default, the encoding is assumed to be UTF-8.
    • You can install chardet (pip install chardet), to automatically detect the encoding.
  • .html
    • You can pass either an HTML snippet, an HTML body or a complete HTML document. If you pass a complete HTML document, every text content outside the body is ignored.
    • The output file will always be encoded in UTF-8.
  • .docx
    • Only text within <w:t> tags (the tags for anything that is text) is exposed. E.g. the mailto link of an e-mail address is not exposed.
  • .pdf
    • Per default, text in PDFs can only be replaced with characters that occur in the file (fonts are stored economically in PDF files).
    • If you install the additional dependencies Poppler (pdftohtml) and wkhtmltopdf, the PDF is rerendered and there is no more restriction on the characters that can be used.

Usage

ExposeText supports files as well as binary data objects. Depending on your use case you can use one of the following interfaces for making modifications.

Installation

expose-text can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install expose-text

Slicing API

The slicing API applies each alteration immediately.

Exposing and modifying text inside a file:

>>> from expose_text import FileWrapper
>>>
>>> wrapper = FileWrapper("myfile.docx")
>>> wrapper.text
'This is the content as string.'

>>> wrapper[12:19] = "new content"
>>> wrapper.text
'This is the new content as string.'

>>> wrapper[33] = "!"  # note that you have to use the updated index here
>>> wrapper.text
'This is the new content as string!'

>>> wrapper.save("newfile.docx")

If you want to work directly with binary data you have to pass the file format:

>>> from expose_text import BinaryWrapper
>>>
>>> wrapper = BinaryWrapper(my_bytes, ".docx")
>>> wrapper.text
'This is the content as string.'

>>> wrapper[12:19] = "new content"
>>> wrapper.text
'This is the new content as string.'

>>> wrapper.bytes  # get the modified file as bytes
b'...'

Functional API

With the functional API, you can queue several alterations based on the initial indices and then apply them together.

>>> wrapper.text
'This is the content as string.'

>>> wrapper.add_alter(12, 19, "new content")
>>> wrapper.add_alter(29, 30, "!")
>>> wrapper.apply_alters()
>>> wrapper.text
'This is the new content as string!'

Development

Install requirements

You can install all (production and development) requirements using:

pip install -r requirements.txt

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

The tests can be executed with:

pytest --doctest-modules --cov-report term --cov=expose_text

Testing in Docker

You can run the test as well in a Docker container:

docker build -t expose-text
docker run expose-text

How to contact us

For usage questions, bugs, or suggestions please file a Github issue. If you would like to contribute or have other questions please email hello@openredact.org.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

expose-text-0.1.3.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

expose_text-0.1.3-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file expose-text-0.1.3.tar.gz.

File metadata

  • Download URL: expose-text-0.1.3.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for expose-text-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c09998af9fe55e343f10302c7be5cb110e15d5d56d6b3a9d6e45106bd5b8aaf5
MD5 d535f1ef61edce933e63edeb9d658e3f
BLAKE2b-256 d53cce135eb5652f078b0723305526deec9e2e358d07d07acac6624ee7c4a522

See more details on using hashes here.

File details

Details for the file expose_text-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: expose_text-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for expose_text-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2bc2c012699170a28109f774149f32fbee29e7d7be0e918a9f1d24537d6374ec
MD5 d2a5b6785cc86ce1614672e9e07d4520
BLAKE2b-256 47c79707e8341cb8d79ba0287b0f25e45e1ad99042490f317d63f8296bac6d68

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page