Skip to main content

LLM-powered text linter

Project description

textaur

textaur takes an input document (plain text or pdf), extracts the text, prompts an LLM to lint/reformat/clean up, and saves the result.

It automagically turns this noisy OCR scan:

48

49
50
50a
ASL

51
AS2
52

REVISED 11/1/2017
-- 33-36.

David dives headfirst onto the small bed. He grabs the
pillow and hugs it tightly. A single tear runs down
his cheek.
DISSOLVE TO:
OMIT 43
OMIT 49
OMIT 50
OMIT SOA
EXT. HOTEL - MORNING ASL
David and Steve walk out of the hotel onto the sunny
street. David looks tired.
STEVE

It's not that big a city, David.

I'll bet there's an arcade at

every corner.
The boys look up the block then turn and look down
the other way.

STEVE
Let's try the next street.
cur To:

OMIT 51
OMIT AS2

into this properly formatted screenplay:

David dives headfirst onto the small bed. He grabs the pillow and hugs it tightly. A single tear runs down his cheek.

DISSOLVE TO:

EXT. HOTEL - MORNING

David and Steve walk out of the hotel onto the sunny street. David looks tired.

STEVE
It's not that big a city, David. I bet there's an arcade at every corner.

The boys look up the block then turn and look down the other way.

STEVE
Let's try the next street.

CUT TO:

The textaur pipeline was originally created to reformat messy text into proper (fountain) screenplay format but the same process (with a much simpler linting prompt) works for general text. Textaur also lets you use your own linting prompt if you have some special use case.

Installation

pip install textaur 

OpenAI API Key

textaur requires an OpenAI API key to use the LLM to lint text. (Text extraction will work without it.) Set the key in your environment:

OPENAI_API_KEY="very secret key"

Optional Dependencies

If you want to use optical character recognition to extract text from pdfs, textaur requires binaries for tesseract and poppler to be installed on your system. These binaries are not included in textaur itself and you'll need to install them separately:

# macOS
brew install tesseract poppler

# Ubuntu/Debian
sudo apt install tesseract-ocr poppler-utils

# Windows
lol

OCR is noticeably slower than direct text extraction, but is usually necessary if you're starting with a scanned file. If you don't know whether you need to use OCR you can always try to extract the text without linting it using the --no-lint flag and then review that output manually to see if it's any good.

Usage

To extract and lint general text from a pdf or text file:

textaur ./path/to/my.pdf

This will extract the text and save it to ./path/to/my_extracted_text.txt and lint the text and save it to ./path/to/my_linted_text.txt. The extracted text will only be saved as a separate file if the input file is a pdf, not if it's plain text.

If it's a screenplay and you want to use OCR:

textaur ./path/to/my_scanned_screenplay.pdf --mode screenplay --ocr

Options

  • -m, --mode <text|t|screenplay|sp>: Type of input text (generic or screenplay). Generic by default/if omitted.
  • -o, --output <file>: Save linted output to this file instead of default
  • --extracted-text <file>: Save extracted unlinted text to this file instead of default
  • --ocr: Use optical character recognition to extract text if it's a PDF (false by default; textaur will try to simply pull out the text if the input is a PDF)
  • --no-lint: Extract and save text only, without AI linting
  • --prompt <file>: File to use as custom AI linting prompt

Additional Notes

  • Input files must exist and be readable.
  • Output and extracted text files will be created in the same directory as the input unless specified.

TODO

  • Support more LLMs
  • More prompts for other kinds of text input/formatting
  • Set of evals for existing and new prompts
  • Config to set:
    • Default mode (text, screenplay, other text types)
    • Preferred LLM once supported
    • LLM API key

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textaur-0.1.0.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textaur-0.1.0-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file textaur-0.1.0.tar.gz.

File metadata

  • Download URL: textaur-0.1.0.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for textaur-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0c3f2b8395b1f03ff6a07d430286559fe9d2b2f8414cf466fe36e76bfc5feac6
MD5 6cd201d3dc02106bc32cb8acaeecaf00
BLAKE2b-256 4c49920d2f8a9c873f3de2cd20b00b6af679b27b9d8d2cd45830542501dc8247

See more details on using hashes here.

File details

Details for the file textaur-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: textaur-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for textaur-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0af4774bf705d2a8a392a78ac80bced59d82e57399a33a737b6dbcbb11674e40
MD5 51c0f31873a2a5fd988cba2886708e8c
BLAKE2b-256 81bc447ace15518c7a92ce4f952d12450496bcb4aa6a207b039f4f8d7ab27952

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page