Finds and highlights text in documents
Project description
txtmarker: Highlight text in documents
txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scan an input document and creates a modified version with highlights embedded.
Current file formats supported:
Installation
The easiest way to install is via pip and PyPI
pip install txtmarker
You can also install txtmarker directly from GitHub. Using a Python Virtual Environment is recommended.
pip install git+https://github.com/neuml/txtmarker
Python 3.6+ is supported
Examples
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
Notebooks
Notebook | Description | |
---|---|---|
Introducing txtmarker | Overview of the functionality provided by txtmarker | |
Highlighting with Transformers | AI-driven highlighting with Transformers |
Configuration
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
Create a new highlighter
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
extension
extension: string
Type of highlighter to create (i.e. pdf)
Optional constructor arguments:
formatter
formatter: callable
Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
chunks
chunks: int
Splits queries into multiple chunks. This is designed for very long text matches.
Highlight text
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
infile
infile: string
Full path to input file
outfile
outfile: string
Full path to output file, i.e. the highlighted file
highlights
highlights: list of (string, string|regex)
List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for txtmarker-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd51918940b565deb180f09124b4c3147b9d0ba27b0d107457474e52d461c901 |
|
MD5 | 0f58f31a01108784cf3344ab5d769f12 |
|
BLAKE2b-256 | 3ecc58f15abf09d84ce0475126345787553dca7afb4ebd8b3c789487c5d500c5 |