CLI tool to strip ChatGPT-specific markers from text
Project description
stripgpt
CLI (and tiny library) to scrub ChatGPT / LLM conversation artifacts from text files or streams.
It removes:
- Private Use Area span markers used by ChatGPT export (U+E200 / U+E201) and the text inside them
- Any remaining private–use characters (Unicode category
Co) - Zero‑width & directionality control characters (ZWSP, ZWNJ, ZWJ, LRM, RLM, LRE, RLE, PDF, LRO, RLO, WJ, LRI, RLI, FSI, PDI)
- (Optional) "bare" leftover tokens like
turn2search5and line range snippetsL10-L42 - (Optional) Normalizes whitespace (collapses runs of spaces / tabs, removes trailing space, trims ends)
Why?
Copying / exporting LLM answers often smuggles in hidden marker & control characters that pollute diffs and source control. stripgpt makes cleaning them automatic and scriptable.
Features
- Stream or file mode (stdin→stdout or specified files)
- In‑place editing with optional backup suffix
- Conservative defaults (whitespace normalized unless
--no-normalize) - Optional removal of leftover token artifacts
- Simple Python API:
from stripgpt import clean_text - Tested on Python 3.12 (minimum supported)
- CI workflow already configured (GitHub Actions)
Installation
Editable (development) install:
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
Once published to PyPI:
pip install stripgpt
Command Line Usage
Read from stdin / write to stdout:
pbpaste | stripgpt | pbcopy
Clean one or more files (output to stdout):
stripgpt session.md > clean.md
stripgpt file1.txt file2.txt > merged-clean.txt
In place (overwrite):
stripgpt -i session.md
In place with backup:
stripgpt -i --backup-suffix .bak session.md
Remove bare tokens & line ranges too:
stripgpt --kill-bare transcript.txt > scrubbed.txt
Preserve original whitespace:
stripgpt --no-normalize notes.txt > cleaned.txt
Specify encoding (default utf-8):
stripgpt --encoding latin-1 legacy.txt > legacy-clean.txt
Detection only (no modification) – JSON report per input:
stripgpt --detect file1.txt file2.txt
# or
cat text.md | stripgpt --detect
Example output:
{"pua_spans":1,"bare_tokens":2,"zero_width":3,"file":"file1.txt"}
Help:
stripgpt -h
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Unhandled / runtime error (message on stderr) |
Library API
from stripgpt import clean_text
cleaned = clean_text(text, kill_bare=True, normalize=True)
Signature:
clean_text(txt: str, *, kill_bare: bool, normalize: bool) -> str
Parameters:
kill_bare: remove tokens liketurn12search5and rangesL10-L20normalize: collapse repeated spaces / tabs, strip trailing & leading whitespace
How It Works
- Remove any span starting with U+E200 and ending with U+E201 (non-greedy), including enclosed text
- Strip any remaining private-use characters (category
Co) - Remove zero-width & bidi control characters
- Optionally remove bare token artifacts & line ranges
- Optionally normalize whitespace
All regexes compiled at import; performance is I/O bound for typical file sizes.
Development
Requires Python 3.12.
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest -q
Or via tox:
tox
Publishing (manual)
Requires build and twine (install via pip install build twine).
python -m build
twine check dist/*
twine upload dist/* # set PYPI_TOKEN or enter credentials
Or use the provided GitHub Actions workflow (add PYPI_API_TOKEN secret).
Run CLI locally without install (editable already works):
python -m stripgpt --help
Continuous Integration
GitHub Actions workflow (.github/workflows/ci.yml) runs tests on Python 3.12.
Suggested Enhancements
- Streaming (line-by-line) processing to reduce memory
- Coverage & badge
- Pre-commit hook config
- Removal statistics / summary report
- Additional token pattern detection
Troubleshooting
| Issue | Hint |
|---|---|
| File unchanged | Use -i for in-place or redirect stdout to a file |
| Hidden chars remain | Inspect with hexdump -C or a Unicode viewer; open an issue with samples |
| Encoding errors | Pass --encoding matching the source file |
| "No tests ran" in CI | Ensure tests/ present & pytest.ini unchanged |
Safety
Use --backup-suffix during first runs for peace of mind.
License
MIT License. See LICENSE file.
Acknowledgements
Inspired by persistent invisible marker annoyances in exported ChatGPT conversations.
Happy clean diffs!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stripgpt-0.2.0.tar.gz.
File metadata
- Download URL: stripgpt-0.2.0.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b250fd859e3dd6dff3ddd89a47138fd0730149b052dc6c7dae4e9b8f2c5777c
|
|
| MD5 |
68783dd1af78660b4b5fea3c4eb0198f
|
|
| BLAKE2b-256 |
50611cd59a12e9f27e2f1285da376a1ff602054aabb272fc4567f0fb778604d0
|
File details
Details for the file stripgpt-0.2.0-py3-none-any.whl.
File metadata
- Download URL: stripgpt-0.2.0-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b68c36b404c3ed73f5f9f98b7ec311761fecf058e6650417ce850dbe34a1d86d
|
|
| MD5 |
525f10cea8faa78ff807535ae2365256
|
|
| BLAKE2b-256 |
46ad0662a44e9133c96bc500996da64fa95359c63e53d4557e4774c35dc6ec4c
|