Skip to main content

CLI tool to strip ChatGPT-specific markers from text

Project description

stripgpt

CLI (and tiny library) to scrub ChatGPT / LLM conversation artifacts from text files or streams.

It removes:

  • Private Use Area span markers used by ChatGPT export (U+E200 / U+E201) and the text inside them
  • Any remaining private–use characters (Unicode category Co)
  • Zero‑width & directionality control characters (ZWSP, ZWNJ, ZWJ, LRM, RLM, LRE, RLE, PDF, LRO, RLO, WJ, LRI, RLI, FSI, PDI)
  • (Optional) "bare" leftover tokens like turn2search5 and line range snippets L10-L42
  • (Optional) Normalizes whitespace (collapses runs of spaces / tabs, removes trailing space, trims ends)

Why?

Copying / exporting LLM answers often smuggles in hidden marker & control characters that pollute diffs and source control. stripgpt makes cleaning them automatic and scriptable.

Features

  • Stream or file mode (stdin→stdout or specified files)
  • In‑place editing with optional backup suffix
  • Conservative defaults (whitespace normalized unless --no-normalize)
  • Optional removal of leftover token artifacts
  • Simple Python API: from stripgpt import clean_text
  • Tested on Python 3.12 (minimum supported)
  • CI workflow already configured (GitHub Actions)

Installation

Editable (development) install:

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Once published to PyPI:

pip install stripgpt

Command Line Usage

Read from stdin / write to stdout:

pbpaste | stripgpt | pbcopy

Clean one or more files (output to stdout):

stripgpt session.md > clean.md
stripgpt file1.txt file2.txt > merged-clean.txt

In place (overwrite):

stripgpt -i session.md

In place with backup:

stripgpt -i --backup-suffix .bak session.md

Remove bare tokens & line ranges too:

stripgpt --kill-bare transcript.txt > scrubbed.txt

Preserve original whitespace:

stripgpt --no-normalize notes.txt > cleaned.txt

Specify encoding (default utf-8):

stripgpt --encoding latin-1 legacy.txt > legacy-clean.txt

Detection only (no modification) – JSON report per input:

stripgpt --detect file1.txt file2.txt
# or
cat text.md | stripgpt --detect

Example output:

{"pua_spans":1,"bare_tokens":2,"zero_width":3,"file":"file1.txt"}

Help:

stripgpt -h

Exit Codes

Code Meaning
0 Success
1 Unhandled / runtime error (message on stderr)

Library API

from stripgpt import clean_text

cleaned = clean_text(text, kill_bare=True, normalize=True)

Signature:

clean_text(txt: str, *, kill_bare: bool, normalize: bool) -> str

Parameters:

  • kill_bare: remove tokens like turn12search5 and ranges L10-L20
  • normalize: collapse repeated spaces / tabs, strip trailing & leading whitespace

How It Works

  1. Remove any span starting with U+E200 and ending with U+E201 (non-greedy), including enclosed text
  2. Strip any remaining private-use characters (category Co)
  3. Remove zero-width & bidi control characters
  4. Optionally remove bare token artifacts & line ranges
  5. Optionally normalize whitespace

All regexes compiled at import; performance is I/O bound for typical file sizes.

Development

Requires Python 3.12.

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest -q

Or via tox:

tox

Publishing (manual)

Requires build and twine (install via pip install build twine).

python -m build
twine check dist/*
twine upload dist/*  # set PYPI_TOKEN or enter credentials

Or use the provided GitHub Actions workflow (add PYPI_API_TOKEN secret).

Run CLI locally without install (editable already works):

python -m stripgpt --help

Continuous Integration

GitHub Actions workflow (.github/workflows/ci.yml) runs tests on Python 3.12.

Suggested Enhancements

  • Streaming (line-by-line) processing to reduce memory
  • Coverage & badge
  • Pre-commit hook config
  • Removal statistics / summary report
  • Additional token pattern detection

Troubleshooting

Issue Hint
File unchanged Use -i for in-place or redirect stdout to a file
Hidden chars remain Inspect with hexdump -C or a Unicode viewer; open an issue with samples
Encoding errors Pass --encoding matching the source file
"No tests ran" in CI Ensure tests/ present & pytest.ini unchanged

Safety

Use --backup-suffix during first runs for peace of mind.

License

MIT License. See LICENSE file.

Acknowledgements

Inspired by persistent invisible marker annoyances in exported ChatGPT conversations.


Happy clean diffs!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stripgpt-0.2.0.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stripgpt-0.2.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file stripgpt-0.2.0.tar.gz.

File metadata

  • Download URL: stripgpt-0.2.0.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for stripgpt-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3b250fd859e3dd6dff3ddd89a47138fd0730149b052dc6c7dae4e9b8f2c5777c
MD5 68783dd1af78660b4b5fea3c4eb0198f
BLAKE2b-256 50611cd59a12e9f27e2f1285da376a1ff602054aabb272fc4567f0fb778604d0

See more details on using hashes here.

File details

Details for the file stripgpt-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: stripgpt-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for stripgpt-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b68c36b404c3ed73f5f9f98b7ec311761fecf058e6650417ce850dbe34a1d86d
MD5 525f10cea8faa78ff807535ae2365256
BLAKE2b-256 46ad0662a44e9133c96bc500996da64fa95359c63e53d4557e4774c35dc6ec4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page