Skip to main content

Finds and highlights text in documents

Project description

Highlight text in documents

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


demo

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.

Current file formats supported:

  • pdf

Installation

The easiest way to install is via pip and PyPI

pip install txtmarker

Python 3.9+ is supported. Using a Python virtual environment is recommended.

txtmarker can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/txtmarker

Python 3.9+ is supported

Examples

The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.

Notebooks

Notebook Description
Introducing txtmarker Overview of the functionality provided by txtmarker Open In Colab
Highlighting with Transformers AI-driven highlighting with Transformers Open In Colab

Configuration

The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

Create a new highlighter

Creates a new highlighter instance.

from txtmarker.factory import Factory
highlighter = Factory.create("pdf")

extension

extension: string

Type of highlighter to create (i.e. pdf)

Optional constructor arguments:

formatter

formatter: callable

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.

chunks

chunks: int

Splits queries into multiple chunks. This is designed for very long text matches.

Page text

Extracts page text from infile and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.

highlighter.pages("input.pdf")

infile

infile: string

Full path to input file

Highlight text

Highlights using provided annotations. Annotated file is stored as outfile.

highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])

infile

infile: string

Full path to input file

outfile

outfile: string

Full path to output file, i.e. the highlighted file

highlights

highlights: list of (string, string|regex)

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call re.escape).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txtmarker-1.1.0.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

txtmarker-1.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file txtmarker-1.1.0.tar.gz.

File metadata

  • Download URL: txtmarker-1.1.0.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.21

File hashes

Hashes for txtmarker-1.1.0.tar.gz
Algorithm Hash digest
SHA256 eeba11e6835a0a2ad6073dba5816f338f4136f6c9773e27a818e8c3d7591b05a
MD5 bbba92eb52fe40a35f12e9e147e46248
BLAKE2b-256 20e5b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1

See more details on using hashes here.

File details

Details for the file txtmarker-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: txtmarker-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.21

File hashes

Hashes for txtmarker-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 372a01c6808ead16974522260cbe232fb546cde1601a1ef930f960d0be5cc63f
MD5 adf513582c6898cd98d3e1aa6a5c46e0
BLAKE2b-256 bab1dfa1daf40cce4a85d2a1363c3e1afd27718f273b20ebe8a08756a0ac6966

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page